Automata Learning Algorithm - 4.2 3-Valued Deterministic Finite Tree Automata

4.2 3-Valued Deterministic Finite Tree Automata

4.5 Automata Learning Algorithm

We have mentioned in Section 2.1.1 that Drewes proposed two different tree automata learning algorithms. Because L^∗_{f ta} is the main algorithm of our learning algorithm, we will take a deeper view.

4.5.1 Learning Algorithm of Drewes

For a set of trees T , let Σ(T ) denote the set of all trees of the form f (t1, ..., t_k), where f ∈ Σ^k, and t1, ...t_k ∈ T . Recall that T (Σ) denotes the set of all trees over the ranked alphabet Σ. Let 2 /∈ Σ be a special symbol with rank 0, and let C(Σ) be the set of all trees in T (Σ∪ {2}) with exactly one occurrence of 2, which we call contexts over Σ. The concatenation c· t with c ∈ C(Σ) and t ∈ T (Σ) ∪ C(Σ) is the tree obtained from c by replacing 2 with t.

Similar to Angluin’s L^∗ algorithm, they use an observation table Ω to construct a tree

automaton. The observation table Ω can be separated into two parts: the upper table ΩU and the lower table ΩL. The rows of ΩU are indexed by the trees in S, S ⊆ T (Σ).

The rows of ΩL are indexed by the trees in T , T ⊆ Σ(S). The columns of Ω are indexed by contexts from a finite set C ⊆ C(Σ). We use Mem^L : T (Σ) → B to represent the membership relation of the tree t in the tree language L. If t ∈ L, Mem^L(t) = T rue, otherwise, M emL(t) = F alse. The cell in the place of row t and column c in the obser-vation table is filled with M emL(c· t), which represents the membership relation of the tree c· t in the tree language L. We use hti to denote the row of t in Ω, and extend to the finite tree set T such that hT i = {hti | t ∈ T }.

The idea behind Angluin’s L^∗algorithm is to construct an automaton by exploiting the Myhill-Nerode congruence of the target language. The Myhill-Nerode congruence≡^L on T(Σ) is defined as follows: t≡^L t⁰ iff for all c ∈ C(Σ), Mem^L(c·t) = Mem^L(c·t⁰). We say tree t and t⁰ are equivalent with respect to C iff for all c∈ C, Mem^L(c· t) = Mem^L(c· t⁰).

To construct the finite tree automaton A^Ω from Ω, two properties have to hold:

1. Ω is closed, that is hti ∈ hSi, for every t ∈ T .

2. Ω is consistent. Let Σ₂(S) = C(Σ) ∩ Σ(S ∪ {2}), the observation table Ω is consistent if hc · si = hc · s⁰i, for all c ∈ Σ2(S) and all s, s⁰ ∈ S with hsi = hs⁰i. If hc · si 6= hc · s⁰i, then s and s⁰ are not equivalent with respect to Σ₂(S). And there exists an separating context c that witnesses this inequivalence.

After assuring that Ω is both closed and consistent, we can construct AΩ = (Σ, Q, ∆, F ) as follows:

• The set of states Q is hSi

• hsi ∈ F if s ∈ L

• For every tree t = f(s¹, ..., s_k)∈ Σ(S), the corresponding transition rule is f(hs¹i, ..., hs^ki) → hti.

Now, let us describe the L^∗_{f ta}learning algorithm. As an extension of Angluin’s L^∗, the learning algorithm asks teacher membership queries and equivalence queries. At first, the

observation table is started with S =∅ and C = {2}. Then, it constructs an automaton from the observation table and performs an equivalence query. If there is a returned example, update the observation table. Repeat this process until no counter-example is returned. The pseudo code of the algorithm is presented in Algorithm 4.1.

Algorithm 4.1 L^∗_{f ta}

When updating the observation table (Algorithm 4.2), decompose the returned counter-example t from the bottom to top and get a subtree t⁰ that is not in S, where t = c· t⁰ for c ∈ C. If t⁰ is also not in T , add t⁰ to ΩL and assure that Ω is closed. Oth-erwise, find the equivalence tree te in ΩU and replace t⁰ with te to get a new tree t_new = c· te. If M emL(t) = M emL(tnew), decompose tnew with above process again.

Else, M emL(t) 6= Mem^L(tnew), and we find a separating context c. Add the context c to observation table and assure that the table is closed. The algorithm of updating the observation table is in Algorithm 4.2. The function close in Algorithm 4.2 checks whether hti ∈ hSi, for every t ∈ T . If there is a tree t ∈ T which hti /∈ hSi, move t from T to S.

This algorithm has several interesting properties. For every tree t∈ T , there is exactly one tree s ∈ S such that hsi = hti. In other words, there is no redundant information being record. Therefore, there is no need to check table consistent. Besides, it can be assured that the amount of contexts is no more than the states in ΩU. The L^∗_{f ta} algo-rithm outputs a finite tree automaton A = (Σ, Q, ∆, F ) with O(r· |Q| · |∆| · (|Q| + m)),

Algorithm 4.2 update

5: if membershipQuery(c· s) = membershipQuery(t) then

6: t := c· s;

where m is the maximum size of counter-examples returned from the teacher, and r is the maximum rank of symbols in Σ. The algorithm requires |Q| + |∆| + 1 equivalence queries, and m +|Q| · (|∆| + 1) membership queries. As mentioned in [14] , the major disadvantage of L^∗_{f ta} is the number of equivalence queries.

4.5.2 Tree Automata Learning Algorithm

Now, we show how we can learn a 3DFT by adapting Drewes’s L^∗_{f ta} algorithm. What we have are a set of positive examples (malware) and a set of negative examples (benign programs). In order to use Drewes’s learning algorithm, we need a teacher to answer membership queries and equivalence queries. Therefore, we simulate the teacher with the given positive and negative examples. The positive examples and negative examples are trees rather than behavior graphs.

For membership queries, we check whether a given tree t belongs to positive exam-ples or negative examexam-ples. The term belong indicates that we check whether a tree is in the set of positive examples or negative examples. If a tree t is in the positive

exam-ples, returns true. If a tree t is in the negative examexam-ples, returns false. If a tree t is in both the positive examples and negative examples or in neither of them, return unknown.

For the last case of membership queries, there is a possibility that the positive ex-amples and negative exex-amples have the common members. That is the case where the malware sample is conscious of being analyzed and is pretending as a benign program.

Then, the generated behavior graph will be identical to the benign programs.

For equivalence queries, we check that whether the samples in the positive examples are accepted and the samples in the negative examples are rejected. For the case that a tree t is both in the positive examples and negative examples, tree t is identified as unknown. If there is a sample that violates this rule, it will be returned as a counter-example. We proceed from the positive samples to negative samples.

Chapter 5

在文檔中以三值樹狀自動機為基礎之惡意程式分析 (頁 37-42)