Effective Distributions - 論分佈估計演算法中之可視鏈結、有效分佈、與模型刪改

The idea of sensible linkage can be closely mapped into another notion called effective distributions. By effective distributions, we mean that by sampling these distributions, the solution quality can be reliably advanced. Thus, the essential conditions for effective distributions are

• the consistency with building blocks, and

• the provision of good directions for further search.

If it is possible to extract effective distributions from the built probabilistic model, we can perform partial sampling using only the effective distributions and leave the rest parts of the solutions unchanged. Thus, the diversity is maintained and we are free from the building block disruption and random drifting problems. For instance, returning to the earlier 16-bit optimization problem, if it is possible to identify those partial models which are built on the sensible linkage like [1 2 3 4] in the first generation and [5 6 7 8]

in the second generation (see the third column of Table 3.1), we can sample only the corresponding marginal distributions which are, in this case, effective. That is, in the first generation, for each solution string, we re-sample only s₁s₂s₃s₄ according to the

marginal distribution and keep s₅s₆· · · s₁₆ unchanged. In the second generation, we re-sample only s5s6s7s8 according to the marginal distribution and keep s9s10· · · s16 with the same values (s1s2s3s4 are converged). In this way, we do not have to resort to increasing population sizes to deal with the problem caused by the disparate building block scalings.

The above thoughts leave us one complication: the identification of effective distribu-tions. However, direct identification of effective distributions may not be an easy task if not impossible. Thus, it may be wise to adopt a complementary approach – to identify those distributions that are not likely to be effective. If there is a way to identify the ineffective distributions, we can bypass them and sample only the rest distributions and thus, approximate the result of knowing effective distributions. Our idea is that if we split the entire population into two sub-populations and use only one sub-population for building probabilistic model, we can utilize the second sub-population to collect statistics for possible indications of ineffectiveness of certain partial distributions in the probabilis-tic model built on the first sub-population. That is, with certain appropriate heurisprobabilis-tics or criterion, we can prune the likely ineffective portions of the model.

In the next chapter, our implementation of the proposed concept in ECGA will be de-tailed. More specifically, a judging criterion will be proposed to detect the likely ineffective marginal distributions of a given marginal product model.

Chapter 4 ECGA with Model Pruning

This chapter starts at reviewing the extended compact genetic algorithm. Based on the idea of detecting the inconsistency of statistics gathered from the two sub-populations, a mechanism is devised to identify the possibly ineffective parts of the built probabilis-tic model. Finally, an optimization algorithm incorporating the proposed technique is described in detail.

4.1 Extended Compact Genetic Algorithm

The extended compact genetic algorithm (ECGA) [6] uses a product of marginal dis-tributions on a partition of the variables. This kind of probability distribution belongs to a class of probabilistic models known as marginal product models (MPMs). In this kind of model, subsets of variables can be modeled jointly, and each subset is considered independent of other subsets. In this work, the conventional notation is adopted that variable subsets are enclosed in brackets. Table 4.1 presents an example of MPM defined over four variables: s₁, s₂, s₃ and s₄. In this example, s₂ and s₄ are modeled jointly and each of the three variable subsets ([s₁], [s₂ s₄] and [s₃]) is considered independent of other subsets. For instance, the probability that this MPM generates a sample s₁s₂s₃s₄ = 0101 is calculated as follows,

P (s₁s₂s₃s₄ = 0101) = P (s₁ = 0) × P (s₂ = 1, s₄ = 1) × P (s₃ = 0)

= 0.4 × 0.4 × 0.5.

In fact, as its name suggested, a marginal product model represents a distribution that is

[s1] [s2 s4] [s3] P (s₁ = 0) = 0.4 P (s₂ = 0, s₄ = 0) = 0.4 P (s₃ = 0) = 0.5 P (s1 = 1) = 0.6 P (s2 = 0, s4 = 1) = 0.1 P (s3 = 1) = 0.5

P (s₂ = 1, s₄ = 0) = 0.1 P (s₂ = 1, s₄ = 1) = 0.4

Table 4.1: An example of marginal product model that defines a joint distribution over four variables. The variables enclosed in the same brackets are considered dependent and modeled jointly. Each variable subset is considered independent of other variable subsets.

In ECGA, both the structure and the parameters of the model are searched and opti-mized with a greedy approach to fit the statistics of the selected set of promising solutions.

The measure of a good MPM is quantified based on the minimum description length (MDL) principle [33], which assumes that given all things are equal, simpler distributions are better than complex ones. The MDL principle thus penalizes both inaccurate and complex models, thereby, leading to a near-optimal distribution. Specifically, the search measure is the MPM complexity which is quantified as the sum of model complexity, C_m, and compressed population complexity, C_p. The greedy MPM search first considers all variables as independent and each of them forms a separate variable subset. In each iteration, the greedy search merges two variable subsets that yields the most C_m + C_p reduction. The process continues until there is no further merge that can decrease the combined complexity.

The model complexity, C_m, quantifies the model representation in terms of the number of bits required to store all the marginal distributions. Suppose that the given problem is of length ` with binary encoding, and the variables are partitioned into m subsets with each of size k_i, i = 1 . . . m, such that ` = ^P^m_i=1k_i. Then the marginal distribution corresponding to the ith variable subset requires 2^kⁱ−1 frequency counts to be completely specified. Taking into account that each frequency count is of length log₂(n+1) bits, where n is the population size, the model complexity, C_m, can be defined as

C_m = log₂(n + 1)

i=1

2^kⁱ− 1 .

The compressed population complexity, C_p, quantifies the suitability of the model in terms of the number of bits required to store the entire selected population (the set of

promising solutions picked by selection operator) with an ideal compression scheme ap-plied. The compression scheme is based on the partition of the variables. Each subset of the variables specifies an independent “compression block” on which the correspond-ing partial solutions are optimally compressed. Theoretically, the optimal compression method encodes a message of probability p_i using − log₂p_i bits. Thus, taking into ac-count all possible messages, the expected length of a compressed message is^P_i−p_ilog₂p_i bits, which is optimal. In the information theory [34], the quantity − log₂p_i is called the information of that message and ^P_i−p_ilog₂p_i is called the entropy of the corresponding distribution. Based on the information theory, the compressed population complexity, Cp, can be derived as

where p_ij is the frequency of the jth possible partial solution to the ith variable subset observed in selected population.

Note that in the calculation of C_p, it is assumed that the jth possible partial solution to the ith variable subset is encoded using − log₂pij bits. This assumption is fundamental to our technique to identify the likely ineffective marginal distributions. More precisely, the information of the partial solutions, − log₂p_ij, is a good indicator of inconsistency of statistics gathered from two separate sub-populations.

4.2 Model Pruning

Our technique to identify the possibly ineffective fragments of a marginal product model is based on the notion that ECGA uses the compression performance to quantify the suitability of a probabilistic model for the given set of solutions. The degree of compression is a quite representative metric to the fitness of modeling, because all good compression methods are based on capturing and utilizing the relationships among data. Thus, if the compression scheme of the MPM built on one set of solutions is incapable of compressing another set of solutions produced under the same condition, then it is very likely that the obtained MPM is, at least, partially incorrect. Using this property, we can perform a systematical checking on the given MPM for the likely ineffective portions.

Suppose that the population of solutions, P , is split into two sub-populations S and T . The model searching is performed on S⁰, the set of promising solutions selected from S. Then we can use the statistics collected from T⁰, the set of solutions selected from T , to examine the built probabilistic model, M . Since each marginal model functions independently, they can be inspected separately. Recalling the former description that a variable subset, which specifies a marginal model, is viewed as a “compression block” that encodes each possible partial solution according to the marginal distribution. That is, the jth possible partial solution to the ith variable subset is encoded using − log₂p_ij bits, where p_ij is the frequency of the jth possible partial solution to the ith variable subset observed in S⁰. Assume that the given problem is of length ` with binary encoding, and there are m variable subsets with each of size k_i, i = 1 . . . m, in the built model M . For the ith marginal model, i = 1 . . . m, we can check whether or not

2^ki

j=1

q_ij(− log₂p_ij) > k_i ,

where q_ij is the frequency of the jth possible partial solution to the ith variable subset collected from T⁰. If the inequality holds, then the compression scheme employed in the ith marginal model is not a good one for compressing the corresponding partial solutions in T⁰, because it encodes a k_i-bit partial solution to a bit string of expected length more than k_i bits. Using the earlier reasoning, such a condition indicates that the marginal model is likely ineffective because T⁰ does not agree on this part of the modeling. Otherwise, it should be able to compress the partial solutions in T⁰.

Explained from a machine learning perspective [35], a good model should generalize well to the unseen instances. Otherwise, it captures coincidental regularities among train-ing data. If the model buildtrain-ing is performed on the portion where linkage is not sensible from the given set of solutions, it will “overfit” to those partial solutions (hitchhikers) that were not subjected to proper selection pressures. Consequently, the regularities cap-tured by this part of modeling tend to be inconsistent with the true problem structure.

Furthermore, the partial solutions that were not subjected to proper selection pressures appear to be random, and it brings about the phenomenon of random drifting mentioned in chapter 2. By its nature, the drifting is random, and two different sub-populations

tend to drift in two different directions. Thus, we can use the statistical inconsistency between S⁰ and T⁰ to locate possible drifting portions of the solutions and identify the likely ineffective parts of the model. By removing the likely ineffective parts, we can forge a partial but more effective model.

An issue in practice concerning the calculation of the inequality is that sometimes one or several possible partial solutions are absent in the set of selected solutions, and leave − log₂p_ij undefined because p_ij = 0. Currently, we handle this practical problem by assigning a very small value, smaller than 1/n, to the p_ij’s that are zero and normalizing them such that p_ij’s are sum to 1 (that is,^P_jp_ij = 1).

在文檔中論分佈估計演算法中之可視鏈結、有效分佈、與模型刪改 (頁 23-29)