Enhancing Chinese Dialect Pronunciation Prediction

(1)

A Generative Data Augmentation Model for

Enhancing Chinese Dialect Pronunciation Prediction

Chu-Cheng Lin and Richard Tzong-Han Tsai

Abstract—Most spoken Chinese dialects lack comprehensive digital pronunciation databases, which are crucial for speech processing tasks. Given complete pronunciation databases for related dialects, one can use supervised learning techniques to predict a Chinese character’s pronunciation in a target dialect based on the character’s features and its pronunciation in other related dialects. Unfortunately, Chinese dialect pronunciation databases are far from complete. We propose a novel generative model that makes use of both existing dialect pronunciation data plus medieval rime books to discover patterns that exist in multiple dialects. The proposed model can augment missing dialectal pronunciations based on existing dialect pronunciation tables (even if incomplete) and the pronunciation data in rime books. The augmented pronunciation database can then be used in supervised learning settings. We evaluate the prediction accuracy in terms of phonological features, such as tone, initial phoneme, final phoneme, etc. For each character, features are evaluated on the whole, overall pronunciation feature accuracy (OPFA). Our first experimental results show that adding features from dialectal pronunciation data to our baseline rime-book model dramatically improves OPFA using the support vector machine (SVM) model.

In the second experiment, we compare the performance of the SVM model using phonological features from closely related dialects with that of the model using phonological features from non-closely related dialects. The experimental results show that using features from closely related dialects results in higher accu- racy. In the third experiment, we show that using our proposed data augmentation model to fill in missing data can increase the SVM model’s OPFA by up to 7.6%.

Index Terms—Chinese dialects, data augmentation, generative model, pronunciation database.

I. INTRODUCTION

C

HARACTER pronunciation databases are key resources in speech processing tasks such as speech recognition and synthesis. For official written languages, such databases are rich. For example, English has the CMU pronouncing dictionary [1], while Mandarin has the Unihan database [2]. For spoken languages, digitized pronunciation resources are not so plen- tiful, however. In China, this is particularly relevant. A 2004 survey of Chinese dialects revealed that more than 86% of the

Manuscript received October 31, 2010; revised March 14, 2011; accepted July 11, 2011. Date of publication October 17, 2011; date of current version Feb- ruary 10, 2012. This work was supported in part by the National Science Council under Grants NSC 98-2221-E-155-060-MY3 and NSC99-2628-E-155-004. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Gokhan Tur.

C.-C. Lin is with the Department of Computer Science and Information Engi- neering, National Taiwan University, Taipei 10617 , Taiwan (e-mail: chu.cheng.

[email protected]).

R. T.-H. Tsai is with the Department of Computer Science and Engineering, Yuan Ze University, Zhongli 320, Taiwan (e-mail: [email protected]).

Digital Object Identifier 10.1109/TASL.2011.2172424

Chinese population can converse in a non-Mandarin dialect, while only 53% can converse in Mandarin. [3] However, there is a serious lack of such databases for non-Mandarin dialects.

This situation impedes the development of speech processing technologies and applications for resource-poor dialects. Since compiling such resources is labor-intensive, our goal is to de- velop a tool to help automate the prediction of character pronunciations for different Chinese dialects.

Currently, most dialect pronunciation databases/dictionaries have been constructed by individual researchers and vary greatly in terms of completeness. If we have complete pronunciation databases for related dialects, we can use standard supervised learning techniques to predict a character’s pronunciation in a target dialect. As mentioned above, however, pronunciations databases for most Chinese dialects are far from complete. Therefore, we propose a novel generative model that makes use of both existing dialect pronunciation data plus medieval rime books to discover patterns that exist in multiple dialects. Unlike previous work, this model does not assume that language evolves like a branching tree, but only that character pronunciations across related dialects do show patterns. The proposed model can augment character pronunciations for a dialect based on existing dialect pronunciation tables (even if incomplete) and the pronunciation data in medieval rime books.

After augmentation, a standard classifier-based pronunciation prediction system can be constructed.

II. BACKGROUND OFCHINESEDIALECTS

A. Mutual Intelligibility

It is widely recognized that Chinese dialects are to a great ex- tent mutually unintelligible. All the southern Chinese dialects have mean sentence intelligibility lower than 30% for nonna- tive speakers [4]. In comparison, Portuguese and Spanish have mutual intelligibility at roughly 60% [5].

Although the mutual intelligibility among Chinese dialects is very low, the character pronunciations across dialects show regular correspondence. For example, the pronunciations of “ 肝” (gan/liver) and “寒” (han/frigid) sound utterly different in Southern Min and Mandarin; but within the dialects themselves, the rhyming is consistent.

B. Rime Books

Other than areal influence, the striking correspondence is largely attributed to historical reasons [6], which can be seen in medieval rime books. Earlier rime books, such as “切韻 (Qieyun)” (601AD), records contemporary character pronun- ciations with fanqie “反切” analyses. Fanqie represents a character’s pronunciation with other two characters, combining

(2)

TABLE I

SYMBOLSUSED INSECTIONIV

the former’s onset and the latter’s rhyme and tone. An English equivalent would be to combine the onset of “peek” / i: k/

and the rhyme of “cat” /kæt/ to get “pat” / æt/.

Obviously, there may be multiple combinations of characters to represent a single pronunciation in the system of fanqie. In contrast, Later rime books such as “韻鏡(Yunjing)”

(900–950AD), did finer phonological analysis, using fixed sets of characters to represent phonological qualities of contemporary analysis [6]. A character pronunciation under the new system has six features, each having value in fixed sets of Chinese characters. The six features are聲母(initials),韻 (rhymes/finals), 攝 (rhyme groups), 聲調 (tones), 呼 (openness), and 等 (grades). For example, the character 含 has 匣 (xia) as 聲母,咸as攝, etc. These features cannot be directly employed to reconstruct Middle Chinese pronunciations, as the meaning of some features are still disputed. Nevertheless, modern dialects still bear the correspondence., and thus rime book features can be used to infer phonological correspondence between characters of the same rime book feature in modern dialects. For example, the two characters “含” (han) and “站” (zhan) are described with the same rhyme group character “咸”

(xian), and they still rhyme in Mandarin, Cantonese, and Amoy, although the pronunciations do not rhyme across dialects. Thus, the rime books are very valuable resources in determining a character’s pronunciation.

III. RELATEDWORK

There are many modern dictionaries using phonetic alphabets to denote pronunciation for specific dialects, such as粤音韻彙(A Chinese Syllabary Pronounced According to the Dialect

of Canton). In 1962 that the first comprehensive cross-dialectal lexicon,漢語方音字彙(Hanyu Fangyin Zihui, Zihui), was published. The original Zihui consists of approximately 2500 character readings with IPA notation from 17 modern Chinese dialects. In addition, the categorical descriptive features from the Middle Chinese rimebook韻鏡(Yunjing) are also provided.

Soon after its publication, Zihui was digitized under Project DOC (Dictionary on Computer) [7]. The Zihui lexicon is in- valuable to the study of diachronic phonology. However, many dialects are still unrecorded. Another problem is that Zihui only contains about 2500 characters; it is far from the total amount of Chinese characters (more than 50 000). The two flaws render the Zihui lexicon unsatisfactory when used as a dialect dictio- nary. Our work then proposes to augment the unseen characters and languages with dialects and character readings recorded in the Zihui lexicon.

To augment the missing data with known information is not a new idea, as practiced by [8], and [9]. Data augmentation is gen- erally done by introducing latent variables to model the training data [10]. In our problem, we need to model dialectal pronunciation data. A model of pronunciations has been proposed for the Romance languages by [11], which allows generation of word forms of both reconstructed languages and modern languages.

A phylogenic tree of Classical Latin, Vulgar Latin, Spanish, and Italian was built to model the evolutionary relationship among these languages. In this tree, Classical Latin is the root, Vulgar Latin is its child, and Spanish and Italian are Vulgar Latin’s de- scendants. In their approach, the pronunciation of the root language must be given.

(3)

However, for Chinese dialects, the applicability of the tree model is disputed. [12] suggested that it may be more appropriate to model the development of Chinese dialects with a net- work. Even if Chinese dialects are placed into a tree structure after Bouchard-Côté et al.’s model and set Middle Chinese, which influenced the largest number of Chinese dialects, as the root language, we still encounter the following problem. Clas- sical Latin’s phonology has been well established. [13] There- fore, the actual pronunciation can be easily deduced from the spelling. Unlike Classical Latin, the phonology and character pronunciations of Middle Chinese are still not wholly clear. For example, we know virtually nothing about the actual tone. Cur- rent reconstructions depend heavily upon medieval rime books, which are known to be a combination of at least two Middle Chi- nese dialects. [14] To derive a proper phylogenic tree, one must first distinguish between the Middle Chinese dialects (at least two according to Ting) and then correctly assign their respective offspring languages. However current studies show that for certain Wu dialects there are at least two substrata, one from the northern Middle Chinese and the other from the southern one.

[15] This directly violates the tree assumption. For a language , without given the actual pronunciation in ’s ancestral language, we cannot use Bouchard-Côté et al.’s model to predict a char- acter’s pronunciation in .

Some researches try to use the resources of other languages to deal with the languages with poor resources. [16] shows adding unannotated text in more languages can improve unsupervised POS tagging performance. [17] uses multilingual acoustic data to improve a newly seen language’s recognition performance, sharing articulatory feature data among languages. These researches assume that linguistic data used during training has patterns which carry over to the newly seen language, but our work only assumes Chinese dialects have consistent phonological correspondence with Middle Chinese and among themselves.

IV. METHODOLOGY

A. Problem Definition

Our task is to augment the pronunciation database of Chinese dialects. For each record, the given pronunciation database lists all existing pronunciations in the 21 dialects from all major dialect groups. That is, some records may be incomplete. Our augmentation model not only utilizes the existing pronunciations, represented by phonemes (which we will refer to as phonological features,) but also rime book features.

More formally, let be the character in a record. Let its categorical rime book features be . For example, the rime book features of character含(han) can be encoded as [匣(xia), 覃(tan),咸(xian),平(ping),開(kai),一(yi)]. The multi-class vector is then converted to a binary vector by con- catenating each “flattened” component of . For example, a component with three possible values is “flattened” to a binary vector of dimension 3. Since for the rime book features there are six components in , would be a binary vector of length .¹Let there be modern dialects . Each dialect has fixed number of phonological features.

12a

TABLE II

ENCODEDPHONOLOGICALFEATURES OF THEDOC DATASET

Fig. 1. Scheme of the input data. There are characters, every of which has its binary rime book feature vector known. Some of the phonological features may be missing. Our goal is to fill the missing values out, and the output is a complete table.

Take the character含 as an example, its rime book feature vector [匣(xia),覃(tan),咸(xian),平(ping),開(kai),一(yi)].

Its phonological features (see Table II) would be[“12”, “43”, /h/, , /a/, , false, /m/] for the Xiamen dialect.

The problem can be stated as follows: suppose there are total phonological features for all dialects, and given binary rime book feature vectors , and a partially filled phonological feature table of dimension by for characters , our goal is equivalent to filling that table out. Fig. 1 depicts the scheme of the input under the problem definition.

Definitions of symbols introduced in this section can be found in Table I.

B. Model Considerations

As described in Section II, nearly every Chinese dialect’s phonology is highly correlated to both the categorical features described in 廣韻 (Guangyun) and 韻鏡 (Yunjing), and to other Chinese dialects’ phonological features. For example, there is a clear correspondence among the rime book feature 深攝 (shen-she), the Cantonese rhyme /am/, and the Xiamen rhyme /im/. While the rime book alone offers much insight into many dialects’ phonology, some characters listed under different rime-book rhymes have clear correspondence among dialects. To augment missing phonological features, all the above phenomena should be taken into consideration.

We propose a model that simultaneously captures phonological similarities across dialects and rime book features, using latent variables which we call superlingual rhymes (SLRs). Our model splits each character’s record into two parts. The first part contains its rime book features while the second consists of is its phonological features. Our task is to augment missing values in the second part. We know that rime book features are highly correlated with phonological features in every Chinese dialect.

(4)

Fig. 2. Plate diagram of our proposed generative model. Shaded nodes are observed data.

Therefore, we employ rime book features to estimate missing phonological features. In addition, our model also employs the other dialects’ phonological features. Our basic idea is to in- troduce superlingual rhymes as an intermediate layer between rime book features and all dialects’ phonological features. The pronunciation of each character can be represented as a mixture of all superlingual rhymes. That is, for each superlingual rhyme, the character has a proportional value. Since the phonological features are all categorical data, they are naturally mod- eled with multinomial distribution. As in every Bayesian model, we impose priors on these multinomials. Following many previous works such as [18] and [19], we chose Dirichlet distribution, which allows analytic expression of posterior probability. The proportional values of a character follow a Dirichlet distribution, whose parameters are decided by log-linear func- tions of the character’s rime book features. This approach is also known as logistic regression. Because of the conjugacy between Dirichlet and multinomial distribution, we can obtain the posterior distribution of a character over SLRs easily [20], [21].

Mixing a generative model with logistic regression is akin to the paradigm advocated by [22]. Similarly, using the multinomial-Dirichlet conjugacy, we can estimate the distribution of a superlingual rhyme over phonological features. Then, because each character’s proportion of each superlingual rhyme and each superlingual rhyme’s proportion of each phonological feature are known, missing phonological features can be augmented.

C. Model Description

A plate diagram for our proposed model is shown in Fig. 2.

Let observation be a tuple of two components: . is an observed phonological feature and is the dialect of . For every

observation of character , there is a

latent SLR ; and the character is a mixture of SLRs. To simplify the explanation, we assume every dialect has only one phonological feature, namely . In the real model, each observation has multiple phonological features for dialect , but the model’s structure is roughly the same.

We describe the model as follows. Let there be SLRs:

. Each has

multinomial distributions over phonological feature values; and a multinomial distribution over the dialects . ’s and ’s are given Dirichlet uniform priors Dirichlet and Dirichlet . In our experiments, each component of both and is set to 0.001, making the prior rather sparse.

Recall that the binary rime book feature of character is . Let there be rime book feature weight vec-

tors of , and each

has the same dimension as . We then define the prior

over all SLRs in character to be a

multinomial distribution with prior Dirichlet

Dirichlet . Note that

. In other words, the prior probability of SLR is proportional to , a log-linear function. is treated as a given value in the generating part; indeed it is given a Normal prior, but we do not change its value through MCMC steps—rather, its value is obtained by maximizing the likelihood of the generative model. We will go into details in Section IV-D.

We now describe the generating process. A plate diagram for this model is depicted in Fig. 2.

1) For each SLR :

a) ;

b) ;

c) for each dialect , Dirichlet .

2) For each character and its binary rime book feature vector :

a) for each SLR , ;

b) Dirichlet ;

c) for each :

• Multinomial ;

• Multinomial .

D. Inference

Without subscripts, the full joint distribution, expressed as

product of distributions, is is

(1)

(2)

(5)

This equation can be rearranged and simplified by moving and to the former term. Variables , , and can be integrated out using the identity

where , , and . More details are

available in Appendix A.

And then we have

(3)

where is the number of observations that have dialect with SLR , is the number of observations that have phonological feature value with SLR and dialect , and is the number of observations with SLR in character .

In (3) we have four variables, , , , and , and cannot really sample from directly. However, it can be shown that there exists an efficient Gibbs sampler to infer ; and we subsequently use optimization methods to compute .

1) The Gibbs Sampler: Gibbs sampling is an MCMC tech- nique to sample from a complex and multivariate distribution.

It can be applied if given variables , sampling from is impossible, but sampling from distributions is feasible. Below is the Gibbs sampler:

1) randomly assign values ; 2) for to an arbitrarily assigned :

• for to ;

a) re-sample new value of .

The variable denotes except . If is sufficiently large, the resultant values can be regarded as a

sample from .

Since the training data already provides us with and , we do not resample them. Neither do we resample , but instead use optimization methods to find the most probable . Now we only need to collect samples from .

Since is a vector of variables consisting of all observed values’ (unobserved) SLRs, is actually multivariate; we use the Gibbs sampling technique here, and obtain samples of

via alternately sampling from . can be expressed as

(4)

where . After reorganization (the

details are in the appendix), we have

(5)

where if the current assignment of ; and otherwise.

Now we describe the Gibbs sampler for : 1) randomly assign values to ; 2) for to an arbitrarily assigned :

• for to ,

a) re-sample new value of using (5).

2) Computing : Unlike , we do not use MCMC techniques to find because it is difficult to derive a Gibbs sampler for . On the other hand, for our purpose an MAP estimate of suffices.

We use L-BFGS to solve this numeric optimization problem.

L-BFGS requires the loss function and the gradient for minimization [23]. First, from (3) we can derive the loss function, which is negative log-likelihood function of :

(6) where is a constant; and recall that for character ,

.

Likewise, we derive the gradient :

where is the digamma function.

As previously stated, we can minimize if we can compute both and ; and minimizing in turn maximizes the

likelihood .

(6)

E. Inference Procedure

In Section IV-D, we have described a Gibbs sampler that samples from the posterior , and in Section IV-D2 we have derived and , which enable us to maximize the likelihood . We use an EM-like algorithm for inference: [24] in alternating steps we sample and maximize , repetitively. The posterior feature-value distribution can be sampled if the latent SLRs are fixed. To augment a missing phonological feature, we output the mode of samples over sev- eral iterations.

V. DATA ANDEVALUATIONMETRICS

A. Data

The experiments are conducted on the DOC dataset described in Section III. In this dataset, each record corresponds to one pronunciation of a Chinese character. For example, the polyphone “正”, with two Mandarin pronunciations (zheng1 and zheng4), has two corresponding records. The number of pro- nunciations for a character is determined by Guangyun. For each record, the DOC dataset lists all existing pronunciations in 21 dialects from all major dialect groups. In the original DOC, pronunciations are transcribed in IPA notation. [25] represented these IPA transcriptions with eight phonological features, listed in Table II. Given that there are 21 dialects and eight features, each record contains a total of 168 phonological features. Some records are incomplete because certain phonological features do not exist in some dialects. After disambiguation of polyphone characters, we have 5403 records.

B. Evaluation Metrics

Individual pronunciation feature accuracy (IPFA) is mea- sured as the number of correctly predicted phonological features over the number of phonological features in the test set. Overall pronunciation feature accuracy (OPFA) is mea- sured as the number of correctly predicted records over the number of records in the test set.

C. Evaluation Scheme

To evaluate prediction accuracy in a given dialect , all phonological features of the dialect are regarded as ground truth labels. Some phonological features of dialects other than may be missing, and they are all filled in using either our proposed model or a baseline classifier, depending on which augmentation method is used in that configuration.

Since one of our focus is augmentation (see Section VI-C), in the augmentation experiments we randomly remove phonological features from all dialects except . The detailed procedure is as follows: first we create two subsets of the main dataset with 10% or 20% of fields (phonological features) missing, respectively. The missing fields are then augmented as previously described. Note that phonological features of dialect are not used for prediction of other phonological features. After the missing pronunciations are augmented, no records have empty fields.

To conduct the statistical significance -test, we perform the following procedure 30 times. We randomly split the records 2:1 into training (67%) and test (33%) data. Since each record is associated with multiple labels, we employ multiclass SVMs

TABLE III

PREDICTIONACCURACYWITH/WITHOUTDIALECTALDATA

to learn the labels independently. The features fed to SVM classifiers are the binary rime book feature vectors ( s) and phonological features of all dialects except dialect . The corresponding labels are phonological features of dialect . And the output from these classifiers are predicted labels, which are phonological features of dialect .

D. -Test

We apply two-sample tests to examine whether one configuration is significantly better than the other with statistical significance.

Two-sample -tests are applied since we assume the samples are independent. As the number of samples is large and the samples’ standard deviations are known, the following two-sample

-statistic is appropriate in this case:

where is mean accuracy, is variance of accuracy, and is sample number (in our experiments, ). If the resulting score is equal or less than 1.67 with a degree of freedom of 29 and a statistical significance level of 95%, the null hypothesis is accepted; otherwise it is rejected.

VI. EXPERIMENTS

We designed three experiments on character pronounciations of the Chaozhou dialect, which is a Min dialect spoken in eastern Guangdong, to evaluate the effect of the following factors:

A. Effect of Dialectal Data on Standard Classifiers

The conventional approach employed by philologists to Chi- nese dialect pronunciation prediction is to find correspondence between rime book categories and modern pronunciation, often through laborious human inspection. However, a clear correspondence between the two does not always exist. In the Wu dialect for example, the rime book categories_夬(guai) and_佳 (jia) are not clearly distinguished, sometimes being referred to as -ua and sometimes as -uo. Introducing dialectal data (other dialects’ phonological features) may help distinguish pronunciation in some dialects.

We train the SVM classifier to predict character pronunciations in Chaozhou. As previously described, we conducted two runs:

1) Rime Book Only (R): In this run only the rime book features, namely_{聲母}(initials),_韻(rhymes/finals),_攝(rhyme groups), _{聲調} (tones), _呼 (openness), and _等 (grades), are included.

2) Rime Book + Full Dialectal Data (R+F): In addition to rime book features, all dialectal data are used. In cases where there are missing pronunciations, a random guess is supplied for each phonological feature for the SVM classifier.

(7)

TABLE IV

PREDICTIONACCURACYWITHDIFFERENTDIALECTGROUPS

The results are listed in Table III. It is obvious that by including dialectal data, we make a significant performance gain.

B. Impacts of Proximate Dialects

[26] reported that POS tagging performance can be improved by including more languages, especially closely related languages. We carried out experiments to see whether using rime book features (R) with closely related dialects (+C) is more effective than with distantly related dialects (+D).

We compared the OPFA of the Xi’an and Chaozhou dialects, which belong to the Mandarin and Min dialect groups, respectively. The Mandarin dialects we use in the experiments are Jinan, Taiyuan, and Beijing; and for the Min dialects we use Xiamen, Fuzhou, and Jian’ou. For each dialect we conduct two runs, the first using dialects from the same dialect group, and the second using dialects from the other dialect group. To make comparison meaningful, we make the ratio of missing entries same in every run by randomly removing entries. And the missing entries are randomly augmented without sophis- ticated augmentation. Thus, each run has 10% pronunciations removed, and augmented with random guesses. Average OPFA over 30 times are listed in Table IV. The results show that R+C outperforms R+D for both Xi’an and Chaozhou dialects by a statistically significant margin.

C. Effect of Data Augmentation

As described in Section I, the data for many Chinese dialects are scarce. Our data augmentation model is designed to fill in missing pronunciation information. If our augmentation model is effective, one application would be to use multiple resource-poor dialects to augment missing data in another dialect’s pronunciation database. For data augmentation, we use the procedure described in Section V-C to fill in the missing pronunciations in the Chaozhou dialect.

For comparison, we employ three different methods to augment the missing data as baselines:

1) Logistic Regression (-L): Using the rime book features , a discriminative model is trained to predict missing phonological values.

2) Naive Bayes (-N): Similar to the logistic regression model, a generative model is trained using to predict missing phonological values.

3) Random (-R): The missing phonological values are guessed randomly.

In this experiment we test two different amounts of removal, 10% and 20%. All the following SVM classifiers use the RBF kernel, with parameters and . The number of SLRs in our augmentation model is set to 200.

TABLE V

EFFECTS OFDATAAUGMENTATIONWITHCLOSELY(R+C)ANDDISTANTLY (R+D) RELATEDDIALECTDATA

TABLE VI IPFA TABLE

The results and corresponding values are listed in Table V.

Using our data augmentation model consistently improves the OPFA accuracy. Interestingly, the margin of improvement seems to be greater when using closely related dialect data than when using distantly related dialect data when using both

and datasets.

VII. ANALYSIS ANDDISCUSSION

We are interested in how the choice of training dialects affect individual feature predictions. Table VI shows percentile IPFA improvement over baseline random augmentation. The R+C run benefits from the augmentation in all features except nasaliza- tion, the reason for which is unclear.

In the R+D run, tone, initial, and final features show worse IPFA after augmentation. This can be explained by considering the assumptions of our proposed model. We assume that the dialects exhibit correspondence among phonological features

(8)

across dialects. That is, corresponding phonological features across dialects should be put under the same SLR. Therefore, if the dialects lack such correspondence, the augmented features may be inaccurate. It is evident that phonological features such as tones, initials, and finals do not have good correspondence across different dialect families [27]. Recent research suggests tones in Min dialects may be related to an innovation of the Wu-Min proto-dialect [27], which Mandarin did not share. As for initials, there is a striking difference between the “heavy”

and “light” initial distinction in Mandarin and Min dialects [28].

Finals also lack good correspondence: the Min dialects have preserved most final stops from Middle Chinese, while Man- darin dialects have lost many. Thus, it is difficult to predict final consonants in Min dialects using Mandarin dialects and vice versa.

The IPFA metric seems to reflect the level of correspondence between the target dialect and other dialects, both closely and distantly related. The possibility of determining dialectal rela- tionships between individual dialects by comparing respective IPFA improvement scores may lead to interesting discoveries.

VIII. CONCLUSION

We propose a novel generative model that makes use of both existing dialect pronunciation data plus medieval rime books to discover phonological patterns that exist in multiple dialects, which are referred to as superlingual rhymes (SLRs) in our proposed model. The proposed model can predict character pronunciations for a dialect based on existing dialect pronunciation tables (even if incomplete) and the pronunciation data in rime books. We evaluate the prediction accuracy in terms of phonological features, such as tone, initial phoneme, etc. For each character, phonological features are evaluated on the whole, overall pronunciation feature accuracy (OPFA). Our first experimental results show that adding features from dialectal pronunciation data to our baseline rime-book model dramatically improves OPFA using the support vector machine (SVM) model.

In the second experiment, we compare the performance of the SVM model using phonological features from closely related dialects with that of the model using phonological features from non-closely related dialects. The experimental results show that using features from closely related dialects results in higher accuracy. In the third experiment, we show that using our proposed data augmentation model to fill in missing data can increase the SVM model’s OPFA by up to 7.6%. We also note that this improvement is greater when using closely related dialect data.

APPENDIXA INTEGRATION OF , ,AND

Since , , and have Dirichlet priors, the posterior distribution of , , and , which have the form

are Dirichlet-multinomial distribution as introduced in [29]. We

clarify how we integrate out these variables with as example.

For convenience, (1) is relisted here again:

By fixing and , terms involving and in (1) are

The latter terms can be refactored to , where is the number of observations with phonological feature value , SLR and dialect . Thus, it can be rewritten as

Using the identity of (3), we have

(7)

Variables and can be integrated out in same fashion.

APPENDIXB

DERIVATION OF THEGIBBSSAMPLER

(8)

(9)

and again using the identity , we have

(9)

using the fact that , where is the

number of observations with SLR , and the fact that a character has fixed number of observations, we can further simplify (9) into

(10)

where if the current assignment of ; and , otherwise.

ACKNOWLEDGMENT

The authors would like to thank Prof. C.-C. Cheng for pro- viding them the DOC dataset and the TASLP reviewers for their valuable comments, which helped them improve the quality of the paper.

REFERENCES

[1] “CMUDICT, CMU Pronouncing Dictionary,” 1998 [Online]. Avail- able: http://www.speech.cs.cmu.edu/cgi-bin/cmudict

[2] J. H. Jenkins and R. Cook, “Unicode Han Database,” Tech. Rep. The Unicode Consortium, 2009.

[3] L.-Q. Tong, “Survey on the usage of Chinese languages and script,”

(in Chinese) Language and Literature Press, Beijing, China, 2006 [Online]. Available: http://www.china-language.gov.cn/LSF/LS- Frame.aspx

[4] C. Tang and V. J. van Heuven, “Mutual intelligibility of Chinese di- alects experimentally tested,” Lingua, vol. 119, no. 5, pp. 709–732, 2009.

[5] J. B. Jensen, “On the mutual intelligibility of Spanish and Portuguese,”

Hispania, vol. 72, no. 4, pp. 848–852, 1989.

[6] E. G. Pulleyblank, “Qieyun and Yunjing: The essential foundation for chinese historical linguistics,” J. Amer. Oriental Soc., vol. 118, no. 2, pp. 200–216, 1998.

[7] M. Streeter, “DOC, 1971: A Chinese dialect dictionary on computer,”

Comput. Humanities, vol. 6, no. 5, pp. 259–270, 1972.

[8] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell, “Text classifica- tion from labeled and unlabeled documents using em,” Mach. Learn., vol. 39, no. 2-3, pp. 103–134, 2000.

[9] X. Lu, B. Zheng, A. Velivelli, and C. Zhai, “Enhancing text catego- rization with semantic-enriched representation and training data aug- mentation,” J. Amer. Med. Inform. Assoc., vol. 13, no. 5, pp. 526–535, 2006.

[10] D. van Dyk and X. Meng, “The art of data augmentation,” J. Comput.

Graph. Statist., vol. 10, no. 1, pp. 1–50, 2001.

[11] A. Bouchard-Côté, P. Liang, T. Griffiths, and D. Klein, “A prob- abilistic approach to diachronic phonology,” in Proc. Empirical Methods in Natural Lang. Process. Comput. Natural Lang. Learn.

(EMNLP/CoNLL), 2007.

[12] M. Ben Hamed and F. Wang, “Stuck in the forest : Trees, networks and Chinese dialects,” Diachronica, vol. 23, no. 1, pp. 29–60, 2006.

[13] W. S. Allen, Vox Latina: A Guide to the Pronunciation of Classical Latin (in Eng.). Cambridge, U.K.: Cambridge Univ. Press, 1978.

[14] P.-H. Ting, “Some thoughts on the reconstruction of Middle Chinese,”

J. Chinese Linguist., vol. 249, no. 6, p. 414, 1995.

[15] T.-L. Mei, “The survival of two pairs of Qieyun distinctions in Southern Wu dialects,” J. Chinese Linguist., vol. 280, no. 1, pp. 1–15, 2001.

[16] B. Snyder, T. Naseem, J. Eisenstein, and R. Barzilay, “Adding more languages improves unsupervised multilingual part-of-speech tagging:

A Bayesian non-parametric approach,” in Proc. NAACL ’09: Human Lang. Technol.: 2009 Annu. Conf. North Amer. Chapt. Assoc. Comput.

Linguist., Morristown, NJ, 2009, pp. 83–91.

[17] S. Stüker, F. Metze, T. Schultz, and A. Waibel, “Integrating multilin- gual articulatory features into speech recognition,” in Proc. 8th Eur.

Conf. Speech Commun. Technol., 2003, Citeseer.

[18] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”

J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003.

[19] S. Goldwater and T. Griffiths, “A fully Bayesian approach to unsu- pervised part-of-speech tagging,” in Proc. 45th Annu. Meeting Assoc.

Comput. Linguist., Prague, Czech Republic, Jun. 2007, pp. 744–751.

[20] G. Heinrich, “Parameter Estimation for Text Analysis,” Tech. Rep.

Univ. of Leipzig, Leipzig, Germany, 2008 [Online]. Available: http://

www.arbylon.net/publications/text-est.pdf

[21] P. Resnik and E. Hardisty, “Gibbs sampling for the uninitiated,” Univ.

of Maryland, 2010, Tech. Rep. CS-TR-4956, UMIACS-TR-2010-04, LAMP-153.

[22] T. Berg-Kirkpatrick, A. Bouchard-Côté, J. DeNero, and D. Klein,

“Painless unsupervised learning with features,” in Proc. Human Lang. Technol.: 2010 Annu. Conf. North Amer. Chap. Assoc. Comput.

Linguist., Los Angeles, CA, Jun. 2010, pp. 582–590.

[23] D. C. Liu and J. Nocedal, “On the limited memory BFGS method for large scale optimization,” Math. Program., vol. 45, no. 3, pp. 503–528, 1989.

[24] A. Dempster et al., “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist. Soc.. Ser. B (Methodological), vol. 39, no.

1, pp. 1–38, 1977.

[25] C.-C. Cheng, “Measuring relationship among dialects: DOC and re- lated resources,” Comput. Linguist., vol. 2, no. 1, pp. 41–72, 1997.

[26] B. Snyder, T. Naseem, J. Eisenstein, and R. Barzilay, “Unsupervised multilingual learning for pos tagging,” in Proc. EMNLP ’08: Proc.

Conf. Empirical Methods Natural Lang. Process., Morristown, NJ, 2008, pp. 1041–1050.

[27] R.-W. Wu, “A Comparative study on the phonologies of Min and Wu dialects,” Ph.D. dissertation, Dept. of Chinese Literature, National Chengchi Univ., Taipei, Taiwan, 2005.

[28] U.-J. Ang, “On the motivation and typology of aspiration and nasal- ization in Sinitic languages,” in Proc. 6th Int. and 17th National Conf.

Chinese Phonol., Taipei, Taiwan, May 1999.

[29] T. Minka, “Estimating a Dirichlet distribution,” Mass. Inst. of Technol., Cambridge, MA, Tech. Rep., 2000.

Chu-Cheng Lin received the B.S. and M.S. degrees in computer science and information engineering from National Taiwan University, Taipei, in 2008 and 2010, respectively.

His current research interests are information retrieval, natural language processing, and computa- tional phonology.

Richard Tzong-Han Tsai received the B.S., M.S., and Ph.D. degrees in computer science and information engineering from National Taiwan University, Taipei, Taiwan, in 1997, 1999, and 2006, respectively.

He was a Postdoctoral Fellow at Academia Sinica from 2006 to 2007. He is now an Assistant Professor in the Department of Computer Science and Engi- neering, Yuan Ze University, Zhongli, Taiwan. His research areas are natural language processing, cross- language information retrieval, biomedical literature mining, and information services on mobile devices.