A new pattern representation scheme using data compression

(1)

A New Pattern Representation Scheme

Using Data Compression

Presented by Chen-hsiu Huang

(2)

Media Data Processing

• How to deal with tremendous variety of media:

– Caterization resolves the latent global information structure

contained in a set of unknown data

– Recognition provides the means of correctly identifying

(3)

• We believe that a new media analysis scheme should alleviate following:

1. Generality, i.e., applicability to media data of any type 2. Facility for both categorization and recognition

3. The ability to cope with indefinitely varying (difficult to represent by a set of finite well-defined models) media data 4. Easily implementable and low processing cost.

(4)

VQ: Vector Quantization

• VQ is applicable to a wide range media data and is implemented very easily.

• Categorization and recognition are performed by partitioning a given feature space into several classes and assigning an unknown vector to an appropriate class

• Due to the lack of general mapping schemes, VQ has been limited to rather low-level analysis tasks for which intrinsic feature vectors are available

(5)

• Frequency domain methods using features such as Fourier-,

DCT-, or wavelet-coeﬃcients have wide applicability but require real-valued inputs. This requirement restricts the applicability of such methods to symbol data such as text

• NN and related algorithms provide a completely diﬀerent general scheme. Such systems also cope well with indeﬁnitely varying sources, yet are only applicable to recognition and require much training even for small tasks with a corresponding high

(6)

The PRDC System

• In PRDC, input data is converted into text and compressed using a set of encoding dictionaries; it generates a

compression ratio vector (CV) as a feature of the original input • The CV is then used as a feature vector in traditional VQ.

Although some of the original information is lost in the text

generation, we can still exploit the attractive properties of VQ by delimiting its scope

(7)

• The realization of PRDC depends on the ability to convert

various media data into text and construct a CV feature sapce • Methods for evaluating the complexity or randomness of ﬁnite

sequences have been studied extensively providing the foundation for a number of data compression techniques

• However, the use of a CV as a general pattern feature, the core of PRDC, is a new concept

(8)

Media Data

Encoder1 Encoder2 Encoder3

Text Text Compressors Compression feature of different media- specific encoders

(9)

• Let A = {ai|0 ≤ i ≤ n − 1} be an alphabet composed of n

characters. A text t is a ﬁnite sequence over A. Let l(t) be its length

• For example, if A = {a, b}, then

t₁ = aaaaaa, t₂ = aabaab, t₃ = ababab, t₄ = abbabb and t₅ = bbbbbb are texts, and l(t_i) = 6 for all i

(10)

• Call substring of t as words, and deﬁne a dictionary dm,t as a set

of words in t with l(t) ≤ m. For example, d_3,t₂ = {a, b, aa, ab, ba, aab, aba, baa} and d_3,t₄ = {a, b, ab, ba, bb, abb, bab, bba}

• A parsing of u by dm,t, which is denoted by p(u, dm,t) is a

successive partitioning of u into words of d_m,t

• If d_m,t contains a signiﬁcant amount of information about u, the parsed word count l(p(u, d_m,t)) is small. Note that p(u, d_m,t) is not unique

(11)

• In order to ensure the uniqueness of parsing, we introduce the concept of greedy parsing gp(u, d_m,t)

• This is deﬁned as the recursive parsing of a text u by taking the longest preﬁx lpf (u, d_m,t) of u in d_m,t followed by the greedy

parsing of the remaining part rest(u), as deﬁned by the following function (φ denotes null text)

gp(u, d_m,t) =   

φ if u = φ

(12)

• For example, gp(t1, d3,t₄) = a.a.a.a.a.a, l(gp(t1, d3,t₄)) = 6 and

gp(t₃, d_3,t₄) = ab.ab.ab, l(gp(t₃, d_3,t₄)) = 3. The uniqueness of gp(u, d_m,t) is proven in Theorem 1.

• Now, we can deﬁne the compression ratio ρ(u, dm,t) of u by dm,t

ρ(u, d_m,t) = l(gp(u, dm,t)) l(u)

• Using the above example, we get ρ(t1, d3,t₄) = (6/6) = 1.0 and

ρ(t3, d3,t₄) = (3/6) = 0.5

(13)

• In order to enhance the featuring power, let us use a tuple of dictionaries D_m,t = (d_m,t₁, d_m,t₂, ..., d_m,t_n) construct from a text set T = {t₁, t₂, ...t_n}. Then we can deﬁne an n-dimensional CV of u.

ρ(u, D_m,T) = (ρ(u, d_m,t₁), ..., ρ(u, d_m,t_n))

• If we choose T = {t2, t4} and m = 3, we obtain the CVs shown in

Table 1.

• The distance between these vectors represent similarities between the original texts.

(14)

CVs for Example Texts

(15)

Mathematical Discussion of CVs

• For CVs to be valid feature vectors of texts, the following minimum requirements must be met:

1. The CV of any text t must be able to be determined uniquely 2. Similmarities between texts should be adequately reﬂected in

(16)

• For (1), we show in Theorem 1 that the use of greedy parsing allows us to map a text to a unique CV in a multidimensional unit cube spanned by D_m,T

• As for (2), we point out in Theorem 2 that the mapping from a text to its CV may be degenerative, i.e., diﬀerent texts can be mapped to an identical CV

• However, in Theorem 3, we show that we can remedy this

situation by extending D_m,T. We show in Theorems 4 and 5 that similar texts are mapped to similar CVs.

(17)

D_m,T = (d_m,t₁, d_m,t₂, ..., d_m,t_n),

the compression ratio vector ρ(u, D_m,T) is determined uniquely.

Moreover, ρ(u, D_m,T) ∈ [0, 1]|T |, where |T | is the cardinality of T and [0, 1]|T | is a |T |-dimensional unit cube

Proof. The uniqueness of ρ(u, d_m,t_k) = (1/l(u))l(gp(u, d_m,t_k)) follows from the uniqueness of l(u) and l(gp(u, d_m,t_k)). As the former is

obvious, we show the latter by showing its minimality. Suppose contrarily that some parsing p(u, d_m,t_k) < l(gp(u, d_m,t_k)), then for

(18)

d_m,t_k, contradicting the greediness of w_gp_i. Therefore, l(gp(u, d_m,t_k)) should be minimal. The latter part of the theorem follows from the obvious fact 1 ≤ l(gp(u, dm,t_k)) ≤ l(u) and the deﬁnition of

(19)

u = v ⇒ ρ(u, D_m,T) = ρ(v, D_m,T), but the converse is not always true.

Proof. The ﬁrst part of the Theorem is obvious. The last part is shown by counter example. Let A = {a, b, 0, 1}, u = ababbb,

v = 010111, T = {ababbb010111, 010111ababbb} and D2,T =

({a, b, ab, ba, bb, b0, 0, 1, 01, 10, 11}, {0, 1, 01, 10, 11, 1a, a, b, ab, ba, bb}) We then have ρ(u, D2,T) = ρ(v, Dm,T) = (0.5, 0.5). But, u = v.

(20)

(ρ(u, D_m,T) = ρ(v, D_m,T)) ∧ u = v then

∃ ˆm.[ρ(u, D_m,T_ˆ _∪{u,v}) = ρ(v, D_m,T_ˆ _∪{u,v})]

Proof. Assuming contrarily, let us attempt to refute the conclusion, getting

∀ ˆm.[ρ(u, D_m,T_ˆ _∪{u,v}) = ρ(v, D_m,T_ˆ _∪{u,v})]

Seperating the compression operations of T and {u, v}, we get

∀ ˆm.[ρ(u, Dm,Tˆ ) = ρ(v, Dm,Tˆ ) ∧ ρ(u, Dm,ˆ {u,v}) = ρ(v, Dm,ˆ {u,v})]

Using the second term, we get

(21)

ˆ

m = max(l(u), l(v)) = l(u)

we get

ρ(u, d_m,u_ˆ ) = 1/l(u) < 1/l(v) ≤ ρ(v, d_ˆ,u) This contradicts the above formula.

In the case of l(u) = l(v), if we choose ˆm = l(u) = l(v) and use the

above formula, we obtain

(22)

Finally, we show that similar texts are mapped to similar CVs.

We ﬁrst show in Theorem 4 that the CV of a concatenated text uv can be approximated by a weighted sum of CVs of u and v.

Then, using this result, we show that a minor variant of u is mapped to a minor variant of the CV of u.

(23)

ρuv − (

l(uv)ρu + l(uv)ρv) ≤ l(uv)

where ρ_uv abbreviates ρ(uv, D_m,T), etc., and r(T ) is the radius of a unit sphere in |T |-dimensional space such as r(T ) = |T | (Euclidian distance) or r(T ) = |T | (City distance).

Moreover, if l(uvw) is large and l(v)  l(uvw), that is, if v is much shorter than uvw, then we get

(24)

l(gp(uv, dm, t)) > l(gp(u, d_m,t)) + l(gp(v, d_m,t)) + 1

> l(gp(u, d_m,t)) + l(gp(v, d_m,t))

This means it’s possible to obtain a trivial nongreedy parsing

p(uv, d_m,t) = gp(u, d_m,t).gp(v, d_m,t) suﬃcing

l(p(uv, d_m,t)) < l(gp(uv, d_m,t)). This contradict Theorem 1. Second, we prove

l(gp(u, d_m,t)) + l(gp(v, d_m,t)) − 1 ≤ l(gp(uv, dm, t))

Assuming that gp(uv, d_m,t) = w_gp₁.w_gp₂...w_gp_l, then there exists a word w_gp_k such that w_gp_k = w_gp− _kw_gp+_k and

gp(u, d_m,t) = w_gp₁.w_gp₂...w_gp− _k. We are given a nongreedy parsing +

(25)

As l(gp(v, d_m,t)) ≤ l(p(v, dm,t)) by Theorem 1, we get

l(gp(v, d_m,t)) ≤ l(p(v, dm,t)) ≤

l(gp(uv, d_m,t)) − l(gp(u, dm,t)) + 1 This implies

l(gp(u, d_m,t)) + l(gp(v, d_m,t)) − 1 ≤ l(gp(uv, dm, t))

Dividing these two inequalities by l(uv) and using _l_(uv)1 = _ll_(uv)(u) _l_(u)1 and _l_(uv)1 = _ll_(uv)(v) _l_(v)1 , we obtain

(26)

To proof the later part, let u = uv and use (1) twice. We then get ρ_uvw = ρ_uvw ≈ l(u) l(uw)ρu + l(w) l(uw)ρw ≈ l(u) l(uw) l(u) l(uv)ρu + l(v) l(uv)ρv + l(w) l(uw)ρw = l(u) l(uvw)ρu + l(v) l(uvw)ρv + l(w) l(uvw)ρw

Therefore, when l(v)  l(uvw), we get

ρ_uvw ≈ l(u)

l(uvw)ρu + 0 +

l(w)

(27)

Encoding Media Data into Text

Sequential Pattern. Given a nontext sequence s = s₁s₂...s_n, ﬁrst segment s to obtain

SEG(s) = v₁v₂...v_l

Replace each segment by as letter to give a text

t = V Q(SEG(s)) = V Q(v₁)V Q(v₂)...V Q(v_l)

(28)

Spatial Pattern. Let P = {ρi,j|(i, j) ∈ Ir × Ic} be a color image composed of pixels of I_r rows and I_c columns, where p_i,j denotes the RGB-vector of a pixel (i, j). First, compile P into a nondirected

weighted graph G(P ) composed of nodes n_i,j, edges e_i,j,k,l, and edge weights w_i,j,k,l.

Here, we deﬁne the edge weight as the color diﬀerence between two terminal ndoes using an appropriate distance function d(x, y), i.e.,

(29)

of G(P ). Starting from a node, at the north-west corner, for

example, traverse M ST (G(P )) in a light-weight-edge-ﬁrst manner outputing a sequence

T RAV (M ST (G(P ))) = (p_i₁_,j₁, dir_1,0)(p_i₂_,j₂, dir_2,1)...(p_i_l_,j_l, dir_l,l₋₁) Here, p_i_k_,j_k and dir_k,k₋₁ denotes the color of the current node and the traverse direction from the previous node n_i_k−1_,j_k−1 to the

current node n_i_k_,j_k. For example

(30)

Finally, we encode T RAV (M ST (G(P ))) into a text

t = V Q(T RAV (M ST (G(P ))))

= V Q((p_i₁_,j₁, dir1,0))...V Q((pi_l,j_l, dirl,l−1))

Note that this scheme is an extension of Freeman’s chain code [28] in that both color contour shape (part of spatial) information and color (spectral) information on P are encoded simultaneously.

(31)

(32)

• Encoding into Text. Input media data is encoded into text using the method described above.

• Text Compression. The buﬀer-type dictionary approach is

adopted in some of the LZ-type text compression algorithms [26], [27].

• Selection of a Dictionary Set. First, prepare a small text set T_s to get D_m,T_s. Perform cluster analysis on the output vector {ρ(t, Dm,T_s) ∈ Tl}. Set the value |T | equal to the number of

(33)

D_m,T, we choose a set of training texts T_c and prepare a set of teaching data {(v, atr(v))|v ∈ T_c}, where atr(v) is the manually prepared attribute of v. We use D_m,T to compress texts v ∈ T_c to obtain a case database CDB = {(ρ(v, D_m,T), atr(v)|v inT_c)}. In categorizing a set of texts T_d, we calculate the respective

CV s = {ρ(v, D_m,T)|v ∈ T_d}, on which we perform a cluster analysis [10].

• Recognition. In Recognition tasks, the nearest element (ρ(v∗, D_m,T), atr(v∗)) ∈ CDB of the incoming ρ(u, D_m,T) is selected and the corresponding atr(v∗) is output as the

(34)

Feature Representability of CV

Computer Programs. The header and source parts of C programs

are gathered into two ﬁles, H.txt and C.txt, both of which contain approximately 2,300 characters. These ﬁles are concatenated to form HH.txt, HC.txt, CH.txt, and CC.txt

The resolution is quite sharp and six CVs can be clearly categorized into three groups: {H.txt, HH.txt}, {HC.txt, CH.txt}, and {C.txt, CC.txt}. The dictionaries {H.txt, HH.txt} compress {H.txt, HH.txt} and {HC.txt, CH.txt} well, but, as expected, this is not the case for {C.txt, CC.txt}.

(35)

(36)

Human Voice. Two 30-second self-introduction speeches are

recorded for ﬁve students. Each ﬁle is divided into frames, and each frame is encoded into one of 26 codes. Several frame lengths were tested and a frame length of 25-ms was found to provide the clearest features

The resolution is not as sharp, except for {K1.wav, K2.wav}. However, it it possible to categorize the data into three groups:

{K1.wav, K2.wav}, {S1.wav, S2.wav, P1.wav, P2.wav}, and {T1.wav,

(37)

(38)

Gray-scale Image. Eight areas of 50 × 50 pixels are selected from Fig. 2d to generate ﬁles in bmp format. These subimages are

encoded into one of 64 codes, (8 MST directions) × (8 grayscales), using the proposed spatial pattern coding method.

The resolution is sharp, with four distinct groups: {3C.bmp,

4C.bmp}, {4A.bmp, 2B.bmp, 5B.bmp}, {4B.bmp, 3B.bmp}, and

{1B.bmp}. This is in good agreement with our visual impression for

the original image. Based onthis categorization, 3B.bmp as an

unknown input will be recognizable as being similar to 4B.bmp, in accordance with our intuition.

(39)

(40)

Applications

(41)

(42)

(43)

(44)

Conclusion

• We have proposed a new pattern representation scheme called PRDC, by which input data is converted into a text and then compressed using a set of dictionaries. PRDC can realize

attractive properties for media analysis: generality, facility for both categorization (class formation) and recognition

(classiﬁcation), ability to cope with indeﬁnitely varying media data, and easy implementability.

• We have presented a mathematical proof of the realizability of a feature space of CVs and demonstrated the usefulness of PRDC

(45)

Future Work

• Future investigations include the application of PRDC to more speciﬁc and sophisticated media analysis tasks and a

performance comparison with other methods.

• We anticipatethat combinations of PRDC with high-level

methods will be eﬀective. In addition, we intend to examine a variant of PRDC that uses media-speciﬁc compressors rather than universal text compressors. We anticipate that this variant