†Chao-Lin Liu ‡Guan-Tao Jin ↑Peter K. Bol §Qing-Feng Liu
↕Wen-Huei Cheng !Wei-Yun Chiu ¶Richard Tzong-Han Tsai ╪Yu-Chun Wang
†Department of Computer Science, National Chengchi University, Taiwan
‡§↕!Department of Chinese Literature, National Chengchi University, Taiwan
↑†Institute for Quantitative Social Science, Harvard University, USA
¶Department of Computer Science and Information Engineering, National Central University, Taiwan
╪Department of Computer Science and Information Engineering, National Taiwan University, Taiwan
†Graduate Institute for Linguistics, National Chengchi University, Taiwan
†[email protected], ↑[email protected], ![email protected], ¶[email protected]
Abstract—We analyzed historical and literary documents in Chinese to gain insights into research issues, and compile1
Keywords—digital humanities, computational linguistics, text analysis, text mining, temporal analysis, geographical analysis, keyword collocation, name disambiguation, history of concepts in China, transliterated words in Chinese historical documents
and overview our studies which utilized four different sources of text materials in this paper. We investigated the history of concepts and transliterated words in China with the Database for the Study of Modern China Thought and Literature, which contains historical documents about China between 1830 and 1930. We also attempted to disambiguate names that were shared by multiple government officers who served between 618 and 1912 and were recorded in Chinese local gazetteers (地方志 /di4 fang1 zhi4/). To showcase the potentials and challenges of computer-assisted analysis of Chinese literatures, we explored thought-provoking questions about two of the Four Great Classical Novels of China: (1) Which monsters attempted to consume the Buddhist monk Xuanzang in the Journey to the West (西遊記 /xi1 you2 ji4/), which was published in the 16th century, and (2) Which major role smiled the most in the Dream of the Red Chamber (紅 樓 夢 /hong2 lou2 meng4/), which was published in the 18th century.
I. INTRODUCTION
The immensely increasing availability of the digitized text material about Chinese history and literature offers great opportunities for researchers to take advantage of advances in computing technologies to conduct historical and literary studies more efficiently and at a larger scale than before, so Digital Humanities [5, 9] has emerged as a relatively new interdisciplinary field in recent decades. Researchers can employ techniques of information retrieval and text analysis to extract and investigate information that are relevant to specific topics in their research. With the help of these computing technologies, researchers can obtain relevant information from a much larger data source than ever before, and this data collection phase can be completed a lot more efficiently as well.
1 We report new results, an on-going work, and few results that we have published in Chinese papers in this paper.
Software tools are useful not just for data collection, and they can and should facilitate preliminary data analysis such that domain experts can spend their precious time and energy on more in-depth research, analyses, interpretation, and judgments.
Despite its relatively short presence in the research community, the ideas of conducting humanistic research with digital facilities have attracted the notice and concerns of leading historians and philosophers in the worlds of both western [4] and Chinese [11] languages.
In this paper, instead of discussing these developmental and philosophic aspects about digital humanities, we show how digital facilities can really support the studies of historical and literary documents in Chinese with four actual examples. Two of these research projects were conducted based on two different and large sources of historical text databases, and the other were based on two very famous classic Chinese novels.
The Database for the Study of Modern Chinese Thought and Literature (DSMCTL2) contains a wide variety of scanned documents and their text material about Chinese history and literature which were published between 1830 and 1930.
DSMCTL is the representative Digital Humanities project for Chinese history selected by the Department of Humanities and Social Sciences of the Ministry of Science and Technology of Taiwan. With 120 million Chinese characters in the repository, DSMCTL has provided a crucial basis for the study about the history of concepts (觀念史 /guan1 nian4 shi3/3
2 http://dsmctl.nccu.edu.tw/
) in modern China, and our research team has conducted a series of investigations about the establishment and variations of concepts, including “sovereignty” (主權 /zhu3 quan2/), “ism”
3 Chinese words consist of one or more individual characters.
For example, “人文” (/ren2 wen2/) is a Chinese translation of
“humanities”, and “人文” is a Chinese word that includes two Chinese characters. When we show a Chinese word the first time, we provide pronunciation information about its
characters with Hanyu Pinyin followed by their tones in digits.
In many of our research projects, we relied on the temporal analysis of keywords for concepts and their co-occurrences.
For example, to study the development of democratic concepts in China, we would search the Chinese translation of
“democracy” in historical documents. In modern text, democracy is consistently translated to “民主” (/min2 zhu3/), so it is intriguing to look for “民主” for the study of democracy.
However, the concept of democracy was a new concept to Chinese people, and people employed transliterated words to refer to democracy, i.e. “德模克拉西” (/de2 mo2 ke4 la1 xi1/), for some time. Hence, researchers would need to know this early embodiment of “democracy” in Chinese texts for their studies, and, to meet this need, we conducted a research for identifying transliterated words with a special book in DSMCTL.
Difangzhi (地方志 /di4 fang1 zhi4/) is a genre of official records published by local governments in China across many dynasties. Names and relevant information about government officers could be recorded in these local gazetteers. Extracting relevant information from Difangzhi and link the information about a particular person will help us strengthen the contents of the China Bibliographical Database Project (CBDB4
To this end, we need to tackle the problem of names that were shared by multiple persons. Some names are very popular than others. For instance, we have 29 records for 王臣 (/wang2 chen2/) and 29 records for 王佐 (/wang2 zuo3/) in the our Difangzhi database. Few of them were owned by the same person, but most were not. Asking domain experts to compare and differentiate records for the same name in a collection of more than 110 thousand name records is quite beyond imagination because of time and costs. Hence, we employed computer programs to identify pairs of name records that might be owned by the same or different persons first to facilitate the name disambiguation task.
) hosted by the Harvard University.
In addition to analyzing historical documents, we explored the applicability of text analysis tools for Chinese literature.
The most famous classic novels immediately came to our mind:
the Romance of the Three Kingdoms (三國演義 /san1 guo2 yan3 yi4/), the Journey to the West (西遊記 /xi1 you2 ji4/), the Water Margin (水滸傳 /shui3 hu3 chuan4/), and the Dream of the Red Chamber (紅樓夢 /hong2 lou2 meng4/). All of them have been translated into English and other languages. Using these novels as the bases for our illustrative studies will be appreciated more easily by the domain experts and ordinary people.
In this paper, we report our work with the Journey to the West (JTTW, henceforth) and the Dream of the Red Chamber (DRC, henceforth). We chose to work on two questions whose answers were not immediately obvious for readers who read JTTW and DRC even not just once.
4 http://isites.harvard.edu/icb/icb.do?keyword=k16229
arguably the most important role in JTTW. In JTTW, many believed that consuming the monk will make one immortal, so a number of monsters chased after the monk for immortality.
For DRC, we wondered the answer to the question: who was the one that smiled most frequently among the three most important characters in the novel, i.e., 寶玉 (/bao3 yu4/), 黛玉 (/dai4 yu4/), and 寶釵(/bao3 chai1/)?
We elaborate on each of these aforementioned studies in separate sections along with discussions about limitations of our current approaches, and wrap up this paper with concluding remarks and some future work.
II. THE DATABASE FOR THE STUDY OF MODERN CHINESE THOUGHT AND LITERATURE
The Database for the Study of Modern Chinese Thought and Literature (DSMCTL) contains more than 120 million Chinese characters. This relatively large database serves as a good resource for research, though it is quite formidable for anyone to read all of its contents.
Software tools offer two levels of assistance and prove to be instrumental for the efficiency and effectiveness in our work.
We have built tools which help historians identify and extract potentially relevant text material for further in-depth research.
We also implemented tools which allow historians to examine statistical properties of important keywords and their co-occurrences.
In a typical study, historians initiated a research problem and provided a list of relevant seed keywords for the target problem. Historical documents were then identified and extracted from DSMCTL based on these initial seed keywords.
Given this initial set of extracted documents, historians could browse them and then selected the documents that were really relevant to the target problem.
We then employed computing tools to help us find very frequent (VF) words in these selected documents, and the historians could inspect the contexts of these VF words to pick a set of new keywords from these VF words. If the historians were curious about the significance and relevance of these new keywords to the target problem, we could extract documents that contained these new keywords for the researchers to inspect. This iterative step of identifying important keywords and extracting relevant documents can continue many times as needed.
With the selected keywords, we could compute their statistical properties. Temporal analysis of the keyword frequency is the most fundamental tool. This analysis provides some visual trends about the appearance of a keyword over time. The ups and downs of keyword frequencies may suggest interesting historical events hidden in the text records, and often triggers new ideas for the study.
Figure 1 (on the next page) illustrates a temporal analysis for the keywords that are related to the movements of constitutional monarchy in China between 1905 and 1911. The curves were drawn based on the statistics collected from the
official documents of the central government. The changing trends of the curves indicated the main activities of the central government.
In addition, we also ran temporal analysis of co-occurrences (commonly referred to as “collocations” in computational linguistics) of keywords. A collocation usually refers to a pair of words, i.e., bigrams, which appeared within a selected range of text, e.g., a sentence. Yet, there were no reasons which prevented us from analyzing trigrams and more complex contexts. The actual meaning or semantics of a word was influenced by its context [3], so the temporal analysis of collocations provided a better opportunity to discover more precise implications of the appearance and/or missing of keywords in the historical documents.
Figure 2 shows the changing trends of selected collocations of keywords for the study on the formation of “Chinese People”. The peaks of the curves correspond to historical events that can be found in Wikipedia.
An obvious barrier in conducting the analysis of collocations was that there were a humongous number of collocations to be examined. With 100 interesting keywords, for example, a historian might have to examine at most 10000 collocations (bigrams). At this moment, we deployed software tools to help historians examine and records the original text of these collocations so that they could efficiently select the collocations that attracted their attention for further study.
With these supportive software facilities, historians can explore the text material contained in DSMCTL with better efficiency and probe into a much larger amount of texts that were almost not possible before. After carefully identifying
important keywords and collocations with the help of the statistical analyses, historians can focus on the reading and interpretation of text materials that were really related to the target problem.
Researchers participating in the DSMCTL project have employed these computing tools and procedures to investigate several historical issues. We studied the changing usage of
“Sovereignty” (主權) between 1860 and 1928, and looked into the migrating collocations of “ism” (主義) between 1896 and 1928. We examined the historical documents to find the burgeoning concept about “Chinese Labor” (華工), “Chinese Businessman” (華商 /hua2 shang1/), and “Chinese People” (華 人) from 1875 to 1909.
A. History of Concepts
More specifically, experiences gained in linguistic research show that “You shall know a word by the company it keeps”
[3]. By analyzing the changing collocations of “Equality” (平 等), we verified the evolution of the concept about “Equality”
in the Chinese society in three periods: 1898-1900, 1901-1914, and 1915-1924, that was proposed and discussed in [6].
Tables 1 and 2 show the statistics of the frequencies of keywords that collocated with the word “Equality” in different periods. We can see that, in different periods, different sets of words collocated with “Equality” more often than others, and these different sets of collocations and their original contexts altogether implied different concepts of “Equality”. At one stage, people sought equality of the nation, when the Qing dynasty was really weak and was invaded by the Western powers. At another stage, people were bothered by the inequality between the public and the private sectors. Equality Figure 2. A temporal analysis of collocations of keywords for the study
on the concept formation of “Chinese People” [7]
1898-1900 1901-1914 1915-1924
西人 43 10 9
強權 39 14 12
萬國 28 21 5
生滅 23 2 0
Table 2. Frequencies of frequent collocations of “Equality” (平等) for the period between 1901 and 1914 [2]
1898-1900 1901-1914 1915-1924
權力 7 121 25
B. Transliterated Words in Historical Documents
We have developed techniques to identify transliterated words in Chinese historical documents [12]. Concepts represented by words like “president” and “democracy” were new to Chinese, and how people recorded these concepts in Chinese words are important for the study of these concepts in Chinese history. Evidence indicated that Chinese transliterations of these new concepts may vary over time, so it is important though difficult to find all different ways to refer to the same concept in Chinese historical documents.
We conducted our study with a special book, 海國圖志 (/hai3 guo2 tu2 zhi4/, HGTZ henceforth) that contains many transliterated words, and the transliterations are already manually marked by domain experts in China. HGTZ was published in the Qing dynasty (ca. 1841AD), consists of 100 chapters, and contains about 680 thousand characters.
Since the transliterated words may not be recorded in any lexicon, we have to look for transliterations from raw strings.
After obtaining strings that appear more than twice, we sifted the candidate strings with different filters. The goal was to reduce the number of candidate strings that will be manually checked by domain experts for transliterated words.
Like a traditional task of information retrieval, we would wish to achieve high precision and high recall rates for this process. Removing the candidate strings aggressively may save the domain experts a lot of time for manual filtering but may result in poor recall. Keeping a lot of candidate strings for manual inspection may boost the recall rate at the cost of poor precision rate, and that would also make the domain experts spend a lot time to complete the selection.
We have three different types of filters in the current work.
The first one is remove strings that frequently appeared in non-historical documents, e.g., literatures such the Dream of the Red Chamber. It is quite unlikely that transliterated words would appear frequently in literary novels.
The second type of filter is to consider the special features of Chinese pronunciation and word formation patterns. The phoneme and lexical patterns of transliterated words may not differ very much from ordinary Chinese words because they will be used in ordinary Chinese texts.
The third type of filter is to consider the textual contexts of the transliterated words. Since the transliterated words in HGTZ were manually marked, we could extract higher-level linguistic features about the transliterated words and employ machine learning methods to mine the rules about the textual contexts in which transliterated words appeared, and then applied the rules to rank the candidate strings.
We ran experiments on a test set of more than 200 thousand candidate strings. Only 57024 of them passed the first and the second type filters, while the recall rate was at 76.54%. We then ranked the remaining candidate strings with the machine-learning based method, and found that 96.14% of the leading 500 candidates were indeed transliterated words.
However, it is possible for a historian to demand higher recall rates because unpredictable problems may ensue the omission of any transliterated words.
Transliterated words that appeared only once in the source text is another problem that we have not handled efficiently yet.
In fact, there is one such instance in HGTZ. At this moment, a string must appear at least twice to be considered as a candidate transliteration. If we would consider strings that appear only once, the number of the candidate strings will increase dramatically and that will lead to big challenges to our data processing capacity.
III. DIFANGZHI (CHINESE LOCAL GAZETTEERS)
Currently, the China Biographical Database Project (CBDB) hosted by the Harvard University offers free download of a database for Chinese biographical information. Enhancing the contents of the CBDB database is an ongoing task, and a good source of additional information may come from the Difangzhi, which is a large collection of local gazetteers compiled by local governments across many dynasties in China.
To this end, we have employed the techniques of regular expressions to extract information about individuals, and, at the time of this writing, we obtained more than 110 thousand records for about 84000 different names. Quite a few of these records are for the same names, e.g., we have 29 records for the name 王臣 (/wang2 chen2/). Some records for the same name may belong to the same person, and some do not.
Before we can augment the CBDB database with the information from Difangzhi, we will have to determine whether or not the owners of the Difangzhi records with the same name are the same of different. We call this task a name disambiguation task, and we are implementing an algorithm laid out by Bol [1]. This algorithm considers many factors in a name record, including birth place, entry into office (入仕方法 /ru4 shi4 fang1 fa3/), office posting ( 職 官 /zhi2 guan1/), alternate names (字號 /zhi4 hao4/), service location (任職地點 /ren4 zhi2 di4 dian3/), service periods (任職時間 /ren4 zhi2 shi2 jian1/), etc., and we compare these factoids of two name records to compute a score for their similarity.
Temporal and spatial information are two important categories of information for the task of name disambiguation.
The information about service periods in two records, for example, may help us differentiate two persons with the same name when the service periods were far apart.
Spatial information includes the birth places, the service locations, and the publication addresses of the Difangzhi books.
If two name records have the same values for these items, they
If two name records have the same values for these items, they