資訊保存與自然語言處理的應用

(1)

國立臺灣大學電機資訊學院資訊工程學系博士論文

Department of Computer Science and Information Engineering College of Electrical Engineering and Computer Science

National Taiwan University Doctoral Dissertation

資訊保存與自然語言處理的應用

Information Preservation and Its Applications to Natural Language Processing

陳瑞呈

Ruey-Cheng Chen

指導教授：項潔博士 Advisor: Jieh Hsiang, Ph.D.

中華民國 102 年 1 月 January 2013

(2)

(3)

Acknowledgments

I wish to thank my advisor Jieh Hsiang, for his help and support over the years, and being my favorite academic role model.

I have been fortunate to work with and learn from many brilliant people, including Chun-Yi Chi, Wei-Yen Day, Andrew S. Gordon, Shao-Hang Kao, Chia-Jung Lee, Shuo-Peng Liao, Yi-Chun Lin, Wen-Hsiang Lu, Reid Swanson, Chiung-Min Tsai, and Jenq-Haur Wang. I owe my deepest gratitude to Hsin-Hsi Chen, Pu-Jen Cheng, Lee-Feng Chien, Jane Yung-Jen Hsu, and Chih-Jen Lin, for their great service in my dissertation and proposal committee.

The credit is shared with my family and in-laws. I am especially indebted to my wife, Hsing-Hui, and my son, Karl. Without their patience and insistence, this dissertation would not have been possible.

(4)

(5)

Abstract

In this dissertation, we motivate a mathematical concept, called information preservation, in the context of probabilistic modeling. Our approach provides a common ground for relating various optimization principles, such as maximum and minimum entropy methods. In this framework, we make explicit an assumption that the model induction is a directed process toward some reference hypothesis. To verify this theory, we conducted extensive empirical studies to unsupervised word segmentation and static index pruning. In unsupervised word segmentation, our approach has significantly boosted the segmentation accuracy of an ordinary compression-based method and achieved comparable performance to several state-of-the-art methods in terms of efficiency and effectiveness. For static index pruning, the proposed information-based measure has achieved state-of-the-art performance, and it has done so more efficiently than the other methods. Our approach to model induction has also led to new discovery, such as a new regularization method for cluster analysis. We expect that this deepened understanding about the induction principles may produce new methodologies towards probabilistic modeling, and eventually lead to breakthrough in natural language processing.

Keywords: information theory; information preservation; induction principle; unsupervised word segmentation; static index pruning; entropy optimization.

(6)

(7)

摘摘摘要要要

在這篇論文中，我們從機率模型的範疇內推導一個稱作「資訊保存」的數學

概念。我們的方法提供了連接數個最佳化原則，例如最大蹢及最小蹢方法

（maximum and minimum entropy methods）的基礎。在這個框架中，我們明確地假設模型推衍是一個目標針對某個參考假說的有向過程。為了檢驗這個理論，我們對無監督式斷詞（unsupervised word segmentation）以及靜態索引刪減

（static index pruning）進行了詳盡的實證研究。在無監督式斷詞中，我們的方法顯著地提昇了以壓縮為基礎的方法斷詞精確度，並且在效能與效率表現上達到與目前最佳方法接近的程度。在靜態索引刪減上，我們提出的以資訊為基礎的量度（information-based measure）以比其他方法效率更好的方式達到目前最好的結果。我們的模型推衍方法也取得了新發現，像是分群分析（cluster analysis）

中的新校正方法。我們期望這個對推衍原則的深度理解能產生機率模型的新方法論，並且最終邁向自然語言處理上的突破。

關關

關鍵鍵鍵詞詞詞：資訊理論；資訊保存；推衍原則；無監督式斷詞；靜態索引刪減；蹢最

佳化。

(8)

(9)

List of Figures

2.1 A practical example of linear regression. . . 12 2.2 Cluster analysis is a way to generalize data into groups. . . 15 3.1 Performance evaluation for the proposed method on the CityU train-

ing set. . . 41 3.2 Performance evaluation for the proposed method on the MSR training

set. . . 42 4.1 Performance results for all the methods on WT2G. Rows indicate

different performance measures (MAP/P@10). Columns indicate different query types (short/long). . . 55

(14)

(15)

List of Tables

3.1 The notation used in the development of regularized compression. . . 27 3.2 Performance evaluation for the proposed method across different test

corpora. The first row indicates a reference HDP run (Goldwater et al., 2009); the other rows represent the proposed method tested on different test corpora. Columns indicates performance metrics, which correspond to precision, recall, and F-measure at word (P/R/F), boundary (BP/BR/BF), and type (TP/TR/TF) levels. . . 36 3.3 Performance evaluation on the Bernstein-Ratner corpus. The re-

ported values for each method indicate word precision, recall, F- measure and running time, respectively. The underlined value in each column indicates the top performer under the corresponding metric. . 37 3.4 A short summary about the subsets in the Bakeoff-2005 dataset. The

size of each subset is given in number of words (W) and number of unique word types (T). . . 38 3.5 Performance evaluation on the common training subsets in the Bakeoff-

2005 and Bakeoff-2006 datasets. The reported values are token F- measure. The boldface value in each column indicates the top performer for the corresponding set. . . 39 3.6 Performance evaluation on two random samples from the common

sets (CityU and MSR subsets) in the Bakeoff-2005 and Bakeoff-2006 datasets. Note that the third run is an outside test. . . 40

(16)

4.1 The overall performance results on LATimes. We round down all the reported measures to the 3rd digit under the decimal point, and ignore preceding zeroes and decimal points for brevity. Underlined entries indicate the best performance in the corresponding group. Entries that are significantly superior or inferior (p < 0.05) to TCP are denoted by superscripts N or H, respectively. Analogously, entries that are significantly superior or inferior to PRP are denoted by subscripts

♯ or ♭, respectively. . . 56 4.2 The overall performance results on TREC-8. See Table 4.1 for the

description of notation. . . 57 4.3 The overall performance results on WT2G. See Table 4.1 for the de-

scription of notation. . . 58 4.4 Correlation analysis for the decision measures on the LATimes cor-

pus. The correlation is estimated using Pearson’s product-moment correlation coefficient, weighted using term frequencies of index entries. 59 5.1 A brief summary of problems that are solvable by using information

preservation. Each row indicates one research problem. The second column indicates the reference hypothesis used in the corresponding information preservation framework. The third column shows relative degree of entropy (low/high) for each problem. Related principles are listed at the last column; abbreviations are used instead of full names. The principles include minimum entropy (MinEnt), maximum entropy (MaxEnt), minimum error (MinErr), and maximum likelihood (MaxLL). . . 63

(17)

Chapter 1 Introduction

1.1 Goals and Methodology

The main theme of this thesis work is a mathematical concept called information preservation. This concept is introduced in the context of probabilistic modeling, as an optimization principle in fitting probabilistic models. It has a very broad application domain and has been shown to be useful in practical settings. Our primary goal here, if not the only, is to convince the readers that this particular theory is interesting and useful so as to maximize the impact.

This is never an easy task, and sitting in a crowded ground along with renowned mathematical ideas, such as maximum entropy and maximum likelihood, has made it even more challenging. Specifically, the intricacy embodied in information preservation adds much complexity to this writing. As its name suggests, the principle is about preserving information carried by fitted probabilistic models, measured by absolute change in entropy. This concept, although being straightforward, is not easy to introduce without careful postulation.

To make it easy to comprehend the underlying concept, we seek to incorporate many practical examples as we give out formal arguments. This application-oriented strategy is applied throughout the work. First, three classic problems in probabilistic

(18)

modeling is treated in the first part of the thesis, showing that information preservation leads to a unified, simple way of looking at model fitting problems. Then, two natural language processing applications are introduced as proof of concepts. We show that information preservation can be used to solve complicated cases where other mathematical principles do not seem to fit, such as inducing a vocabulary model in unsupervised word segmentation, or reducing a predictive model in static index pruning. Finally, we conclude this work by discussing some possible extensions of information preservation.

1.2 Contribution

The concept of information preservation provides a common ground for maximum and minimum entropy methods. In this unified framework, we highlight that the usual principles applied in model fitting, such as maximization and minimization of entropy, are not arbitrary decisions. An implicit reference hypothesis is there to guide the optimization process, and model fitting seeks to approach that hypothesis in terms of model complexity in a constrained space. Existence of such a reference model is a strong indication on how the search for fitted model shall be conducted.

In this respect, our account partly justifies the incompatibility between entropy maximization and minimization. From the view of information preservation, these principles no longer contradict each other; they are just different use cases that find application in different scenarios.

Our study also covers two natural language processing applications, which are unsupervised word segmentation and static index pruning. Our unique view to unsupervised word segmentation is a renovation of conventional compression-based methodology. We introduce the idea of preservation of vocabulary complexity into an ordinary compression-based framework. The renovated approach, regularized compression, has significantly boosted the segmentation accuracy, and has achieved comparable performance as several state-of-the-art methods. Our approach can be

(19)

easily retargeted to larger text corpora since it is more efficient and less resource- demanding. Our experimental study also reveals one interesting application of this method as to applying parallel segmentation to one large text collection. This application is not addressed in this thesis work, but will be a primary focus in our further exploration.

We have also obtained fruitful results in our experiments in static index pruning.

Conventional methods usually solve this problem using impact-based decision criteria based on the idea that impact is directly related to retrieval performance. Our approach provides an alternative view to this problem. We show that information- based measures can achieve competitive performance in a more efficient way. From the experiments, we uncover the possibility of combining multiple decision criteria to prioritize the index entries, which is backed by low correlation found between the proposed approach and the other state-of-the-art methods. Although further exploration for this interesting idea is beyond the scope of this thesis, our result has led to deeper insights in this research problem and has thus paved the way for exciting future efforts.

1.3 Outline

Chapter 2 covers the definition and implications of information preservation. We first motivate this concept from a theoretical aspect, showing that it generalizes the ideas advocated by well-known mathematical principles such as maximum entropy and minimum-entropy methods. Then, we apply information preservation to some well- studied fundamental problems in probabilistic modeling, including feature selection, regression, and cluster analysis. Through these working examples, we show that our approach leads to simple, intuitive solutions in accordance with the known results.

In Chapter 3, we give an overview on the development of unsupervised word seg-

(20)

mentation. This is the first application example of information preservation studied in this thesis work. Our approach to unsupervised word segmentation is motivated from a different angle, based on the idea of text compression. We first assume that a solution to unsupervised word segmentation is equivalent to a series of choices made in text compression. These choices can be determined by solving a general optimization problem that corresponds to information preservation. This problem is intractable due to various combinatorial choices involved, so we present a simpli- fied iterative solution as an approximation. Later, we describe a full experimental setup that covers experiments on two standard benchmarks. The experimental result shows that the performance of the our approach is comparable to that of state-of- the-art unsupervised word segmentation methods in terms of F-measure.

We introduce the problem of static index pruning in Chapter 4 as the second application of information preservation. We first briefly go over the past effort in this area. Then we proceed to formulate this problem in the information preservation framework, in which static index pruning is seen as a process of degenerating a predictive model. Under this rationale, the primary goal in static index pruning is to preserve the predictive power of the pruned index. This idea is written out as an information preservation problem, and is solved by using an approximation algorithm. This construction leads to an efficient formula that we can use to rank postings. We test our approach in three standard collections of different size, and compare its performance with two other state-of-the-art methods. The result shows that our approach is competitive to the best performance across all experimental settings.

Some incomplete results are loosely discussed in Chapter 5. We briefly review the problem types that are solvable by information preservation and discuss the advan- tages of our approach. Finally, in Section 3.8, we summarize the contributions and possible implications of this work to conclude this thesis.

(21)

Chapter 2 Information Preservation

2.1 Overview

Much of the recent development in natural language processing is devoted to probabilistic modeling. Nowadays, probabilistic methods have prevailed in almost every subfield in natural language processing. Some notable examples include probabilistic context-free grammar (PCFG) in parsing, max-margin methods in classification, mixture models in clustering, to name a few.

What lies in the core of a probabilistic method is often an assumption, written in probabilistic statements, about certain aspect of language that one tries to model.

The probabilistic statements discussed here can be generative, describing a generative process that produces the language we have observed, or discriminative, describing a decision model that helps differentiate one type of observation from the other.

Probabilistic methods are not merely about assumptions. One needs to find the right model, e.g., parameters, to fit these assumptions, and this is when optimization techniques come into play. Behind any optimization problem, there is always some optimization principle, based on which one specifies the objective in mathematical terms. Such a principle usually has a deep root in statistics and physics,

(22)

and sometimes it can be very philosophical.

Most of the well-known principles carry very simple messages that generalize essential aspects in scientific modeling, which in turns lead to successful applications.

Maximum likelihood (ML), for one, is used in almost every scientific area. There are many others, including maximum a posteriori (MAP), a Bayesian-remake of maximum likelihood; maximum entropy (ME), asserting that one shall seek the most general model; and minimum description length (MDL), advocating the famous philosophical concept of Occam’s razor.

In this chapter, we present the concept of information preservation that we believe encapsulates some of the common wisdom in probabilistic modeling. As we will show later, information preservation covers two well-known principles, maximum entropy and minimum entropy, as its special cases. It also provides a ground for dis- cussions about how information-based optimization can be designed in probabilistic modeling.

2.2 Formal Definition

We define an information preservation problem as follows. Let Θ be the hypothesis space, and ΘC ⊂ Θ be a subset of hypotheses that satisfy some constraint C. Let X be some random variable and H be some entropy-like information quantity which is well-defined defined for every θ ∈ Θ.

Given that θ0 ∈ Θ, the following minimization problem is said to be an information preservation problem:

minimize |H_θ(X) − H_θ₀(X)|

subject to θ ∈ ΘC.

(2.1)

The general idea is very simple. Given the search space ΘC and a reference hypothesis θ₀, the best hypothesis is the one that minimizes the absolute change in

(23)

information, measured by some quantity H. In other words, we want to find an approximation of the reference hypothesis θ0 in the constrained search space ΘC; the closeness (or distance) between two hypotheses is measured in terms of the change in entropy.

Information preservation comprises three key aspects, information, hypothesis, and approximation, which are detailed in the following paragraphs.

Information The information of a probabilistic model is measured by the quantity H. The definition of H can be further generalized as long as the extension is measurable for every hypothesis θ ∈ Θ. This function can be entropy H(X), joint entropy H(X, Y, Z), or even conditional entropy H(X|Y ). To write a problem in the form of information preservation, we need to think about by which quantity the information is best represented.

Hypothesis In an information preservation problem, we search for a hypothesis that is closest to the reference hypothesis θ0 in terms of the information estimate H. Since this is a strong assumption that biases the search process, the choice of reference hypothesis needs to be justified. A good starting point is to consider some trivial hypothesis, as we will shortly visit in the later examples, that can be easily formed based on the observations.

Approximation The optimization problem may not be analytically feasible to solve. As we will see later, information preservation can sometimes fall back to entropy maximization or minimization, none of which is easy to solve when combinatorial choices are involved. Therefore, it is essential to know whether the formulation can be solved approximately using efficient numerical methods.

In the next section, we discuss the relation of information preservation to the other principles.

(24)

2.3 Relation to Other Approaches

2.3.1 Principle of Maximum Entropy

Jaynes (1957a; 1957b) proposed the principle of maximum entropy, stating that a distribution that maximizes entropy is the one making the minimal claim beyond the knowledge embedded in the prior data. Therefore, the maximum-entropy choice in model fitting is equivalent to choosing the least informative distribution from an entire family of distributions.

This principle is usually cast as a constrained entropy maximization problem, which in simple continuous cases is solved by convex programming. Consider that ΘC ⊂ Θ is the set of feasible parameter combinations. The entropy maximization problem is written as follows.

maximize Hθ(X) subject to θ ∈ ΘC.

(2.2)

It is clear that, when we take θ0 as the most uninformative model in Θ, the information preservation problem reduces to the maximum entropy problem.

2.3.2 Principle of Minimum Cross-Entropy

Kullback (1959), who also developed the Kullback-Leibler divergence, proposed the principle of minimum cross-entropy, which is sometimes called the principle of minimum discrimination information. It asserts that, as more data is observed, a new distribution shall be fitted and this choice needs to be made as close to the original distribution as possible, measured in terms of Kullback-Leibler divergence.

Let the original distribution be denoted as p and the new one as q. The Kullback- Leibler divergence between p and q (in that order) can be expressed in terms of the entropy of p and the cross entropy of p and q. In the following derivation, we consider only the discrete case. Note that we use D(p||q) to denote the Kullback-

(25)

Leibler divergence and CE(p; q) to denote the cross entropy.

D(p||q) =X

x

p(x) logp(x) q(x)

=X

x

p(x) log p(x) −X

x

p(x) log q(x)

= −H(p) + CE(p; q).

(2.3)

Since p is a known distribution, minimizing D(p||q) is equivalent to minimizing the cross entropy CE(p; q). This objective is convex since it is a linear combination (or more precisely, convex combination) of the − log(x) function, which is also convex.

Again, this is a convex programming problem. The principle can be written as the following optimization problem in our notation:

minimize CE(θ₀; θ) subject to θ ∈ Θ_C.

(2.4)

There is no direct connection between information preservation and the principle of minimum cross-entropy. Nevertheless, they agree on one essential point that closeness between the old and the new distributions shall be highly valued in search for a better fit for the new facts.

2.3.3 Minimum-Entropy Methods

Minimum entropy is not formally recognized as a mathematical principle, but it has already found many applications in combinatorics and machine learning. The general philosophy is to find the most informative model that satisfies the constraint, and is sometimes seen as incompatible with the principle of maximum entropy, which states exactly the opposite.

While the compatibility issue between two theories is beyond our work, we do find a few examples that shows how minimum entropy leads to other principles, such as

(26)

error minimization.

When setting the reference hypothesis θ0 at some extreme points, e.g., Hθ = 0, information preservation reduces to minimum entropy. Nevertheless, optimizing toward such extremes may not always be a good justification by itself. In the next section, we will review some typical assumptions regarding setting such reference hypotheses.

2.4 Examples

2.4.1 Feature Selection

Feature selection is a technique commonly used in supervised learning tasks, such as text classification, where the feature dimension is usually very high. In this case, feature selection determines a subset of feature dimensions so that the learning task can be carried out more efficiently.

Let O be an set of labeled instances h(x1, c1), (x2, c2), . . . , (xn, cn)i where each xi ∈ 2^d is a d-dimensional feature vector and each ci ∈ 1, . . . , K is a class label. Here, for simplicity, we assume binary features and finite class labels.

Every subset of these features forms a unique partition over the instances. A subset of c binary features generates 2^c different possible combinations, and according to which one can sort the instances uniquely into different groups. That is to say, any two instances with the same combination of these features are assigned to the same group.

Assume that we attach a new variable ai ∈ N (can sometimes be greater than K) to every instance (xi, ci). The variables {a1, a2, . . . , an} collectively represent some partition over the instances, determined by some selected set of features.

Consider the entropy of class labels, H(C), and the conditional entropy of class labels given the partition assignment, H(C|A). Here, C and A are random variables

(27)

for class label and partition assignment, respectively. These two quantities have the following relation:

H(C) = H(C|A) + I(C; A), (2.5)

where I(C; A) is the mutual information between class labels and the partition assignment.

Now, let θ0 denote an ideal partition over the instances, with which we assign ai = ci

for each instance. This partition always exists because we have knowledge about the true class labels. Given this ideal partition, the conditional entropy H(C|A) becomes 0, and therefore, for any partition θ over the instances, we know that Hθ0(C|A) is always less than or equal to Hθ(C|A).

We use θ₀ as the reference partition and write down the information preservation problem. Let θ^∗ be the optimal partition. We have the following equations:

θ^∗ = arg min

θ

|Hθ(C|A) − Hθ0(C|A)|

= arg min

θ

Hθ(C|A)

= arg min

θ

H_θ(C) − I_θ(C; A)

= arg max

θ

Iθ(C; A).

(2.6)

The results accords with a commonly-used feature selection method that chooses the optimal feature set by maximizing the mutual information.

2.4.2 Regression

Consider a series of data points O = h(x1, y1), (x2, y2), . . . , (xn, yn)i where each (xi, yi) ∈ R². We assume a functional relation y = f (x), between the two components and want to find the best fit f in a family of functions F . This is a classic problem that regression analysis seeks to solve.

(28)

Figure 2.1: A practical example of linear regression.

Let us first write this problem in the typical way. We use random variables X and Y to denote some random data point that we observe. The structural dependence between X and Y is defined as follows. Note that, in this definition, ǫ is a random variable that denotes the error:

Y = f (X) + ǫ, ǫ ∼ N (0, σ²).

In regression analysis, the goodness of fit is assessed in some way according to the error made by the structural assumption. For this reason, we need to specify an error distribution. Here, we take an usual assumption that ǫ follows a Gaussian distribution with zero mean and σ² variance.

We estimate the entropy of ǫ empirically using the data points that we have observed, O, as samples. Specifically, this sample-based estimate is defined as:

H(ǫ) = −˜ 1 n

n

X

i=1

log p(ǫi), (2.7)

where ǫ_i = y_i− f (x_i) denotes the error term for the i-th data point. Plugging in the

(29)

Gaussian density into Equation (2.7), we arrive at the following equation:

H(ǫ) =˜ 1 n

n

X

i=1

ǫ²_i 2σ² +1

2log 2σ²π

. (2.8)

It can be easily shown that ˜H(ǫ) is uniquely minimized when ǫi = 0 for all i = 1, . . . , n. In other words, when the search space F is unrestricted, any function that trivially maps the observed values {xi} to their counterparts {y_i} minimizes the error entropy. To see why this is the case, we check the first and the second derivatives of ˜H(ǫ) with respect to f (xi):

∂

∂f (x_i)H(ǫ) = −˜ 1

nσ²ǫi, (2.9)

∂²

∂f (xi)²H(ǫ) =˜ 1

nσ². (2.10)

The first derivative vanishes when ǫi = 0 for all i = 1, . . . , n. Since the second derivative is always positive, the stationary point is a global minimum.

Now we are ready to apply information preservation to this problem. We chose some trivial function f0 as the reference model, i.e., for all i = 1, . . . , n, f0(xi) = yi. Applying the concept of information preservation, the best model f^∗ ∈ F is the one that minimizes the absolute difference in entropy against the reference model, as in:

f^∗ = arg min

f ∈F

| ˜Hf(ǫ) − ˜Hf0(ǫ)|

= arg min

f ∈F

H˜f(ǫ) − ˜Hf0(ǫ)

= arg min

f ∈F

H˜f(ǫ)

= arg min

f ∈F

1 n

n

X

i=1

ǫ²_i 2σ² + 1

2log 2σ²π

= arg min

f ∈F n

X

i=1

ǫ²_i.

(2.11)

(30)

This result is in line with the least squares method that minimizes the sum of squared errors of a fit.

2.4.3 Cluster Analysis

Consider that we have observed n data points hx1, x2, . . . , xni in some d-dimensional space. We have a structural assumption that these data points belongs to some number of clusters. Our job in cluster analysis is to find these clusters and assign each data point to one.

For simplicity, we assume that the number of clusters, K, is known a priori. Our goal here is to test a number of hypotheses about the underlying structure, and each hypothesis is expressed as a set of assumed cluster centers, denoted as θ.

We use the notation X and µ(X) to denote a random data point and its cluster center, respectively. We assign to each X the closest cluster center in θ. In other words, we write µ(X) as:

µ(X) = arg min

µ∈θ

kX − µk₂,

Therefore, we have the following definition for the error distribution:

X = µ(X) + ǫ, ǫ ∼ N (0, Σ).

(2.12)

Here, ǫ is the displacement between a data point and its cluster center, and it follows a multivariate Gaussian distribution centered at the origin 0, i.e., a zero vector, with covariance Σ, a d by d symmetric and positive-definite matrix.

The probability of observing some data point is written out somewhat differently because the support sets of density functions overlap with each other. In this defi-

(31)

o o

o o o

o o

o

o o

o o o o

o

o o

o

o o

o o o

o o

o

o o o

o

o o o

o

x x x x

x

x x

x

x x x

x

x x

x

x x

x x x

x

x x

x x x

x x

x

x x

x

x x

xx x

Figure 2.2: Cluster analysis is a way to generalize data into groups.

nition, we introduce a normalization factor Z(θ):

p(x|θ) = 1

Z(θ)N (x; µ(x), Σ), Z(θ) =

Z

x^′

N (x^′; µ(x^′), Σ) dx^′.

(2.13)

Following the idea in the previous section, we estimate the entropy of p(x|θ) empirically using the sample-based method. The entropy is written as:

H(X|θ) = −1 n

n

X

i=1

log p(x|θ)

= 1 2n

n

X

i=1

(xi− µ(x_i))^TΣ⁻¹(xi− µ(x_i)) + 1

2log |2πΣ| + log Z(θ).

(2.14)

This definition of error entropy has two implications. First, the sum of square errors, a penalty function commonly used in cluster analysis, is actually a special case of information preservation. This happens when the likelihood p(x|θ) is left unnormalized, i.e., setting Z(θ) to some constant. Second, solving the information preservation problem generally with respect to this equation leads to a regularized version of sum of squared errors.

(32)

Let us discuss the first case. When Z(θ) does not depend on θ, it can be shown that a trivial hypothesis θ0 that assigns each data point to itself as the cluster center, i.e., µ(xi) = xi for all i = 1, . . . , n, is the global minimizer of this entropy. To see how, let us suppose Z(θ) = c for some constant c and take the derivative tests with respect to µ:

H(X|θ) = 1 2n

n

X

i=1

(xi− µ(xi))^TΣ⁻¹(xi− µ(xi)) + 1

2log |2πΣ| + log c, (2.15)

∂

∂µH(X|θ) = 1 n

X

i:µ(xi)=µ

Σ⁻¹(xi− µ), (2.16)

∂²

∂µ∂µ^TH(X|θ) = 1

nΣ⁻¹. (2.17)

We first establish the convexity of H(X|θ) by writing out its Hessian, which is positive definite. Since the first derivative vanishes at θ = θ0, we confirm that θ0 is a global minimum.

Applying the concept of information preservation, the best hypothesis θ^∗ is actually a minimizer of weighted sum of squared errors. This result accords with usual assumption in cluster analysis, and suggests that information preservation is a more general method in this aspect.

θ^∗ = arg min

θ

|H(X|θ) − H(X|θ0)|,

= arg min

θ

H(X|θ) − H(X|θ0),

= arg min

θ

H(X|θ),

= arg min

θ n

X

i=1

(x_i− µ(x_i))^TΣ⁻¹(x_i− µ(x_i)).

(2.18)

Solving the information preservation problem analytically is difficult. This is a difficult problem because Z(θ) depends on the positions of all the cluster centers, and this entanglement cannot be easily analyzed. Therefore, as a necessary compromise, we suggest solving this problem in the context of medoid-based clustering, meaning

(33)

that cluster centers have to sit on some data points rather than arbitrary choices.

This leads to our second result in this example.

Generally, the exact value of Z(θ) can be written out as a summation of integrals.

For any cluster center µ, let N(µ) denote the neighborhood of µ, which is a set of points that have µ as the cluster center, i.e., µ = arg min_µ′kx−µ^′k₂ for all x ∈ N(µ).

With knowledge about N(µ), we can write Z(θ) as a summation of the densities contributed by individual cluster centers:

Z(θ) =X

µ∈θ

Z

x∈N (µ)

N (x; µ, Σ) dx. (2.19)

Solving this equation is still very challenging, since each integral has to be solved numerically with respect to N(µ), which is not easy to compute in high dimension.

To approximate N(µ), we replace the domain of the innermost integral with a con- fidence region Cr(µ, Σ) of the Gaussian distribution N (µ, Σ), i.e., the set of points that lies within some radius r to the mean µ:

Z(θ) = X

µ∈θ

Z

x∈Cr(µ,Σ)

N (x; µ, Σ) dx

=X

µ∈θ

Z

x∈Cr(0,Σ)

N (x; 0, Σ) dx.

(2.20)

We use the following equation to determine r for each µ. The idea is to take half of the distance from µ to its nearest neighboring cluster center.

r = 1 2min

µ^′∈θkµ − µ^′k₂. (2.21) It is now feasible to solve the innermost integral in Equation (2.20) numerically.

Considering only medoids as cluster centers, we can compute this approximation to Z(θ) much more efficiently for any given hypothesis. Nevertheless, further simplifi- cation is needed to efficiently explore the search space. Here, we propose using an iterative algorithm to find the solution. This algorithm starts from the trivial hy-

(34)

pothesis that has n cluster centers and iteratively removes one cluster center away.

That is to say, it iteratively solve the following information preservation problem for i = 1, . . . , n − K.

θ⁽⁰⁾ = θ₀, (2.22)

θ⁽ⁱ⁾ = arg min

θ

|H(X|θ) − H(X|θ⁽ⁱ⁻¹⁾)| (2.23)

= arg min

θ

1

2n∆_WSSE(θ, θ⁽ⁱ⁻¹⁾) + log Z(θ) Z(θ⁽ⁱ⁻¹⁾

, (2.24)

where ∆WSSE(θ, θ⁽ⁱ⁻¹⁾) is the change in sum of squared errors,

X

x:µ⁽ⁱ⁻¹⁾(x)6=µ(x)

(x − µ(x))^TΣ⁻¹(x − µ(x)) − (x − µ⁽ⁱ⁻¹⁾(x))^TΣ⁻¹(x − µ⁽ⁱ⁻¹⁾(x)) .

The decision criteria in Equation (2.24) is actually a regularized version of weighted sum of squared errors. The term δWSSE(θ, θ⁽ⁱ⁻¹⁾) is always positive because removing cluster centers increases errors. The regularization term, log Z(θ) − log Z(θ⁽ⁱ⁻¹⁾), is negative due to the reduced total densities. Practically, we often want to drop the absolute value, because the regularization term is lower bounded by − log 2.

This regularization method penalizes cases where the removed cluster center is closed to the others. In other words, it favors more spread-out clusters. This property also aligns with a conventional heuristic that suggests maximizing inter-cluster dis- tances.

2.5 Concluding Remarks

In the previous sections, we have briefly reviewed several classic problems in probabilistic modeling. Conventional approaches to these problems, such as error minimization and maximization of mutual information, are shown to be equal to information preservation. Moreover, information preservation leads to a new regularization

(35)

method to data clustering. This discussion, though not being thorough and rig- orous, has suggested that information preservation is a more general and unified optimization strategy in probabilistic modeling.

So far, our exploration is limited to general modeling tasks. In the following chap- ters, we go on and uncover the potential of information preservation in a broader application domain, namely, natural language processing.

(36)

(37)

Chapter 3 Unsupervised Word Segmentation

3.1 Overview

Word segmentation is the computational task that aims at identifying word boundaries in a continuous text stream (Goldwater et al., 2009)., and it is essential to natural language processing because many NLP applications are designed to work on this level of abstraction. In language, words are the smallest lexical units that carry coherent semantics or meaning, That is to say, the meaning of individual words can be postulated or interpreted on their own. Therefore, for NLP applications that focus mostly on semantics, words are just the ideal representation to operate on.

In this chapter, we focus on unsupervised word segmentation, which aims to mitigate the word segmentation problem without using any training data. This technique has been vigorously studied by cognitive scientists and regional linguists, and has thus become an intersection of theory and application.

In cognitive science, unsupervised word segmentation is often related to lexical acquisition, a study regarding how infants acquire language in their early ages. It is

(38)

generally believed that infants learn word boundaries largely without supervision, even though a number of visual or gestural cues have been shown useful in this process, so knowledge about practical methods is of great value to cognitivists in developing their theories.

Regional language groups also express high interest in unsupervised word segmentation, since in certain Asian languages, such as Japanese and Chinese, recognition of words is a non-trivial task. These notable exceptions do not practice the “whites- pace separation” convention, which is a norm to many other languages such as English, and therefore words of an utterance in these languages are usually written out without separation, making further application even more difficult. Advanced supervised approaches using conditional random-fields have achieved promising result in many related regional task, but their adoption is generally limited by the costly and labor-intensive process of preparing training data. Unsupervised methods, on the contrary, do not have this burden and remain an economical alternative for most of the NLP projects.

In the following sections, we introduce a compression-based framework for unsupervised word segmentation, assuming that true words in a text can be uncovered via text compression. Here, we point out that efficiency, as usually assumed in data compression, is not the only factor to consider. Vocabulary complexity, which is the entropy rate of the resulting lexicon, needs to be controlled to some extent. As its complexity increases, more effort is required to harness the vocabulary. To lever- age this issue, we suggest applying the principle of information preservation. We create a new formulation to word segmentation and write it out as an optimization problem, which is called regularized compression.

The rest of this chapter is structured as follows. We briefly summarize related work on unsupervised word segmentation in Section 3.2. In Sections 3.3 and 3.4, we introduce the proposed formulation. The iterative algorithm and other technical details for solving the optimization problem are covered in Sections 3.5 and 3.6.

(39)

In Section 3.7, we describe the evaluation procedure and discuss the experimental results. Finally, we present concluding remarks in Section 3.8.

3.2 Related Work

There is a rich body of literature in unsupervised word segmentation. Many early work focused on boundary detection, using cohesion measures to assess the association strength between tokens. Further development stressed on language generation, casting word segmentation as a model inference problem; recently, this approach has gained success when more sophisticated nonparametric Bayesian methods were introduced to model the intricate language generation process. The latest research trend is to use minimum description length principle to optimize the performance of existing methods. In the following subsections, we review a number of popular methods in this area and briefly discuss their strengths and weaknesses.

3.2.1 Transitional Probability

In linguistics, one of the earliest and the most influential idea for boundary detection is the transitional probability. This idea is uncovered in 1955 by Harris (Harris, 1955), who exploited the connection between token uncertainty and the presence of a boundary. In his seminal work, he stated that the uncertainty of tokens coming after a sequence helps determine whether a given position is at a boundary. Since then, many researchers has taken this idea and given their own formulations regarding the token uncertainly at a boundary.

One reasonable way to to model the token uncertainty is through entropy. Start- ing from late 1990s, many research efforts in Chinese text analysis have begun to follow this trail (Chang and Su, 1997; Huang and Powers, 2003). The definition of branching entropy that we know of today is given in 2006 by Jin and Tanaka- Ishii (Tanaka-Ishii, 2005; Jin and Ishii, 2006). Formally, the branching entropy of a

(40)

word w is defined as:

H(X|Xn= w) = −X

x∈χ

P (x|w) log P (x|w),

where χ is the set of all the possible characters. Note that this formulation is different from the conditional entropy of X given Xn.

There are other ways to model token uncertainty. In 2004, Feng et al. proposed the accessor variety that estimates the uncertainty based on the frequency rather than the probability (Feng et al., 2004). In his definition, we write accessor variety as:

min{AVL(w), AVR(w)},

where the left/right accessor variety AVL and AVR is defined as the number of distinct tokens that precede/follow w, respectively.

3.2.2 Mutual Information

In 1990, Sproat and Shih discovered the use of mutual information in detecting two- character groups (i.e., bigrams) in the Chinese text (Sproat and Shih, 1990). Their work amplified the idea of using an association measure to assess how likely a string is indeed a word. This idea is straightforward enough: The segmentor first determines the most probable bigram in an input phrase by using mutual information, places two boundaries around the discovered bigram, and recursively process the remaining sub-phrases.

To date, there have been many association measures based on mutual information (Chien, 1997; Sun et al., 1998) introduced to the word segmentation problem.

(41)

3.2.3 Hierarchical Bayesian Methods

The past few years have seen many nonparametric Bayesian methods developed to model natural languages. Many such applications were applied to word segmentation and have collectively reshaped the entire research field. In cognitive science, unsupervised word segmentation is often related to language acquisition, an active area in which researchers explore how infants acquire spoken languages in an unsupervised fashion in their very early years. This connection had been brought to earth in 1950s in behavioral study.

The perspective that cognitivists take toward unsupervised word segmentation is usually a generative one, in which an underlying probabilistic structure is assumed for governing the generation of word sequences. To uncover the boundaries, one learns the latent model parameters from the text stream by applying inference techniques based on maximum-likelihood (ML) or maximum a-posteriori (MAP) principle, and then the most likely segmentation boundaries is discovered by using Viterbi-like algorithms.

Hierarchical Bayesian methods were first introduced to complement conventional probabilistic methods to facilitate context-aware word generation. Goldwater et al. (2006) used hierarchical Dirichlet processes (HDP) to induce contextual word models. Specifically, they expressed the contextual dependencies as:

Pr(w1. . . wn$) = P (w1|$)

" _n Y

i=2

P (wi|wi−1)

#

P ($|wn).

The underlying generative process assumed by HDP is as follows. Note that P0 is uniform across different word lengths.

G ∼ DP(α0, P0) W_i|W_i−1= l ∼ DP(α₁, G) ∀l

(42)

This approach was a significant improvement over conventional probabilistic methods, and has inspired further explorations into more advanced hierarchical modeling techniques. Such examples include the nested Pitman-Yor process (Mochihashi et al., 2009), a sophisticated installment for hierarchical modeling at both word and character levels, and adaptor grammars (Johnson and Goldwater, 2009), a framework that generalize the idea behind HDP to probabilistic context-free grammars.

The generative description of the nested Pitman-Yor Process (NPYP) is summarized as follows. Note that P0 is generated from yet another character-level HPYP.

G ∼ PY(d, θ, P0) Gl ∼ PY(d, θ, G) ∀l G_l,k ∼ PY(d, θ, G_l) ∀l, k Wi|W_i−1= l ∼ Gl

Wi|Wi−1 = l, Wi−2 = k ∼ Gl,k

Detailed descriptions regarding adaptor grammars is referred to Johnson et al. (2009)

3.2.4 Minimum Description Length Principle

The minimum description length (MDL) principle, originally developed in the context of information theory, was adopted in Bayesian statistics as a principled model selection method (Rissanen, 1978). Its connection to lexical acquisition was first uncovered in behavioral studies, and early applications focused mostly on applying MDL to induce word segmentation that results in compact lexicons (Kit and Wilks, 1999; Yu, 2000; Argamon et al., 2004).

Kit and Wilks proposed a compression-based approach, called description length gain, in 1999. In their formulation (Zhao and Kit, 2008), the description length gain is defined as:

L(C) − L(C[r → w] ⊕ r),

(43)

C = hc₁, . . . , c_Ni Character sequence W = hw1, . . . , wMi Word sequence, M < N

U = {u0 = 0, u1, . . . , uK} Positions of utterance boundaries

A_C Character alphabet

AW Word alphabet (i.e., lexicon)

fhci,...,cji n-gram frequency

Table 3.1: The notation used in the development of regularized compression.

where C[r → w] is the resulting string after replacing all occurrence of w in C with r. Note that ⊕ denotes the concatenation operator, and the description length L(C) is obtained from the Shannon-Fano code of C. Specifically, L(C) = |C| × H(V (C)), where V (C) denotes the character vocabulary of C.

More recent approaches (Zhikov et al., 2010; Hewlett and Cohen, 2011) used MDL in combination with existing algorithms, such as branching entropy (Tanaka-Ishii, 2005; Jin and Ishii, 2006) and bootstrap voting experts (Hewlett and Cohen, 2009), to determine the best segmentation parameters. On various benchmarks, MDL- powered algorithms have achieved state-of-the-art performance, sometimes even sur- passing that of the most sophisticated hierarchical modeling methods.

3.3 Preliminaries

In this section, we describe the notation and the basic assumptions in our work. In developing the notation, we try to be consistent with the mathematical convention, but we find that, in the procedure that we are about to motivate, certain mathematical aspects (e.g., sequence and set) regarding one underlying object cannot be easily expressed without clutter. For clarity, we intentionally abuse the notation in some cases, using the same expression to represent different mathematical view- points when the meaning is clear. A short summary about the notation is given in Table 3.1.

The input to our word segmentation algorithm is a set of unsegmented texts, each of which represents an utterance. Consider that this set is of K elements, and all

(44)

the utterances consist totally of N characters. For brevity, we denote the input as a sequence of characters C = hc1, . . . , cNi, as if conceptually concatenating all the utterances into one string. Analogously, a segmented output is defined as a sequence of words W = hw1, . . . , wMi, for some M ≤ N.

We represent the positions of all the utterance boundaries in C as a sequence of integers U = hu0 = 0, u1, . . . , uK = Ni, where u0 < u1 < . . . < uK. Therefore, to find the k-th utterance in C, we look at the characters between positions u_k−1+1 and u_k. In other words, the k-th utterance is the subsequence hc_u_k−1₊₁, . . . , c_u_ki.

A valid word segmentation W of C has to satisfy two properties. First, W represents the same piece of text as C does. Second, W respects the utterance boundaries U, meaning that any word wi ∈ W does not span over two utterances. In mathematical terms, there exists a sequence of boundaries B = hb0 = 0, b1, . . . , bM = Ni such that:

(i) b0 < b1 < . . . < bM, (ii) cb_i−1+1. . . cbi = wi for 1 ≤ i ≤ M, and (iii) U ⊂ B.

The unique elements in a sequence implicitly define an alphabet set, or lexicon. This property is not limited to character or word sequences. Hereafter, for any sequence S, we denote its alphabet as A_S. As we shall find out later, sometimes we need to refer to a sequence that is either a mix of characters or words, or of the same type that we do not care. We use a generic term token sequence to this type of sequence.

3.4 Regularized Compression

Word segmentation results from compressing a sequence of characters. By compression, we refer to a series of steps that we can iteratively apply to the character sequence C C to produce the final result W . In each of these steps, called compression steps, we replace some subsequence hci, ci+1, . . . , cji in C with a string w = cici+1. . . cj (word). It is clear that, for every possible W , such compression steps always exist.

(45)

Since the subsequence hci, ci+1, . . . , cji may occur multiple times in C, for simplicity, we further assume that, in each step, all the occurrences of subsequence hci, ci+1, . . . , cji are compressed at once. That is to say, we replace every subsequence in the set

{hck, . . . , cli|ck= ci, ck+1 = ci+1, . . . , cl = cj}

with the string w. Therefore, the compression step can be denoted as a rewrite rule, as in:

w → hci, ci+1, . . . cji.

This assumption greatly simplifies the modeling, since we do not need to worry about the compression order of these occurrences, while it is no longer clear whether going from any C to any W using compression is feasible. In the following discussion, we stick with this approximation and show that it does not lead to disastrous results.

Relaxing this constraint is beyond the scope of this work and remains an open issue.

Here, we first review some basic properties of a compression step. Generally, applying a compression step to C has the following effects:

1. The total number of tokens in C decreases, and its alphabet set AC is expanded to include a new token w = ci. . . cj. Specifically, the size of C, denoted as |C|, is reduced by f_hc_i_,...,c_j_i(j − i). When a compression step reduces more tokens, this step is said to be more coherent. It means that the subsequence to be replaced appears more frequently in the text, and thus is more likely to be a collocation;

2. Asymptotically, the vocabulary complexity, measured in terms of entropy rate, always increases. We can estimate the entropy rate empirically by seeing a token sequence as a series of outcomes drawn from some underlying stochastic process. The absolute change in entropy rate is called deviation.

(46)

The first property seems like a natural consequence. It says that the effort to describe the same piece of information, in terms of the number of tokens, get reduced at the expense of expanding the vocabulary. The second property is less obvious: It means that compression has a side effect of redistributing probability masses among observations, thereby causing deviation to entropy rate. Here, we describe a stronger result that the deviation is always positive, i.e., entropy always increases. A formal argument to this is given in Appendix A.1.

Since every compression step involves a choice of subsequence to be replaced, we can say that word segmentation is basically a combinatorial optimization problem.

Understanding the effects gains us some insight to develop reasonable choices. In the following definitions, we seek to quantify the effects of cohesion and deviation.

Specifically, they are defined with respect to two token sequences S1 and S2, such that S2 is the result of applying some series of compression steps to S1:

cohesion ≡ −|S2|/|S1|, (3.1)

deviation ≡ | ˜H(S2) − ˜H(S1)|. (3.2)

Note that, for some token sequence S, ˜H(S) denote the empirical entropy rates for the random variable S ∈ AS, in which the probability p(S) is estimated using the relative frequencies in sequence S.

We suggest that each choice shall be optimized toward high cohesion and low deviation. This is easily justified, because high cohesion leads to choices that cover more collocations, and low deviation translates to less effort for language users to harness the new vocabulary. Conceptually, to go from sequence S to some new sequence S^′ using only k compression steps, we consider the following procedure:

minimize (−cohesion(S, S^′), deviation(S, S^′)) subject to S^′ respects U

r1, . . . , rk are valid steps to go from S to S^′

(3.3)

(47)

Note that the objective is in vector form, and therefore it has to be scalarized by using a convex combination of its components. Specifically, by valid compression steps, we means that r1, . . . , rk satisfy:

S⁽⁰⁾ = S, S^(k)= S^′,

S^{(i−1) r}−→ Sⁱ ⁽ⁱ⁾ for i = 1, . . . , k.

Using this procedure, we are able to find the best sequence, in terms of the aforemen- tioned criteria, that is reachable in k compression steps from S. Nevertheless, this formulation is not very useful in practice, because in word segmentation, we want to find the best sequence within some predefined compression rate ρ. Compression rate is actually the average word length of the sequence, and we expect that matches the average word length of the language.

Rewriting the procedure to reflect this need can nevertheless trivialize our first decision criteria. When we impose a constraint on the compression rate, the cohesion will always be the same. Despite this issue, search in the entire space remains challenging because there are exponentially-many valid choices to consider. These issues are addressed in the next section using an iterative approximation method.

3.5 Iterative Approximation

Acknowledging that exponentially many feasible sequences need to be checked, we propose an alternative formulation in a restricted solution space. The idea is, instead of optimizing for segmentations, we search for segmentation generators, i.e., a set of functions that generate segmentations from the input. The generators we consider here is the ordered rulesets.

An ordered ruleset R = hr₁, r₂, . . . , r_ki is a sequence of translation rules, i.e., com-

(48)

pression steps. Each of these rules takes the following form:

w → hc1, c2, . . . , cn, i

where the right-hand side denotes some n-token subsequence to be replaced, and the left-hand side denotes the new token to be introduced. Applying an ordered ruleset R to a token sequence is equivalent to iteratively applying the translation rules r1, r2, . . . , rk in strict order.

This notion of ordered rulesets allows one to explore the search space efficiently using a greedy inclusion algorithm. The idea is to maintain a globally best ruleset B that covers the best translation rules we have discovered so far, and then iteratively expand B by discovering new best rule and adding it to ruleset.

The following procedure repeats several times until the compression rate reaches some predefined value ρ. In each iteration, the best translation rule is determined by solving a modified version of Equation (3.3), which is written as follows:

(In iteration i)

minimize −α × cohesion(S⁽ⁱ⁻¹⁾, S⁽ⁱ⁾) + deviation(S⁽ⁱ⁻¹⁾, S⁽ⁱ⁾)) subject to S⁽ⁱ⁾ respects U

r is a valid compression step

(3.4)

Note that the alternative formulation is largely a greedy version of Equation (3.3).

The vector-form objective is scalarized by using a trade-off parameter α. A brief sketch of the algorithm is given in Algorithm 1.

3.6 Implementation

Additional care needs to be taken in implementation. The simplest way to collect n-gram counts for computing the objective in Equation (3.4) is to run multiple scans

(49)

Algorithm 1 The proposed word segmentation algorithm Require: ρ

S ← C

for all t = 1, 2, . . . do

if compression rate is less than or equal to ρ then Leave the loop

end if

Solve Equation (3.4) to find the step r Rewrite S in-place using r

end for return S

over the entire sequence. Our experience suggests that using an indexing structure that keeps track of token positions can be more efficient. This is especially impor- tant when updating the affected n-gram counts in each iteration. Since replacing one occurrence for any subsequence affects only its surrounding n-grams, the total number of such affected n-gram occurrences in one iteration is linear in the number of occurrences for the replaced subsequence. Using an indexing structure in this case has the advantage to reduce seek time.

A detailed description about the implementation is discussed in Appendix A.2. Note that, however, the overall running time remains in the same complexity class re- gardless of the deployment of an indexing structure. The time complexity for this algorithm is O(T N), where T is the number of iterations and N is the length of the input sequence.

Although it is theoretically appealing to create an n-gram search algorithm, in this preliminary study we used a simple bigram-based implementation for efficiency. We considered only bigrams in creating translation rules, expecting that the discovered bigrams can grow into trigrams or higher-order n-grams in the subsequent iterations.

To allow unmerged tokens (i.e., characters that was supposed to be in one n-gram but eventually left out due to bigram implementation) being merged into the discovered bigram, we also required that that one of the two participating tokens at the right- hand side of any translation rule has to be an unmerged token. This has a side