HAMPI - 藉由正規限制式之擬真運算處理符號檔案輸入

HAMPI [31] is a SMT solver based on top of STP and it is for string constraints over fixed-size string variables. HAMPI do some implements to make constraints expressing membership in regular languages and fixed-size context-free languages. Given set of constraints, HAMPI outputs a string if constraints are satisfiable. The implement steps of HAMPI are following :

1. Normalize constraints to HAMPI’s core string constraints

2. Translate core string constraints to a quantifier-free logic of bit-vectors 3. Call STP to judge the bit-vector constraints are satisfiable or

unsatis-fiable.

4. If constraints are satisfied, compute and output the string solution.

HAMPI does evaluations of symbolic testing using KLEE. First, HAMPI conpile the input grammar file into STP bit-vector constraints. Then it con-verts STP bit-vector constraints into C language code that expresses by API

on target program. The line coverage result of experiments is better than original KLEE.

Figure 3-1: The Comparison of Some Relative Works

4 Methods

In this thesis, we implement the part of converting regular expression to constraints and resolving these constraints with original constraints. In this section, we will describe our method to handle the standard input constraints with user-defined regular expression information. We introduce that how does KLEE test with symbolic files first, and explain the reasons that why do we choose the method. Finally, we display the overview of our method.

The details about implementation will be described at next section.

4.1 The architecture of KLEE with symbol files

As figure 4-1 show, the source code of C language is compiled in LLVM bitcode by llvm-gcc (or clang) at first. At the next step, KLEE reads the bitcode and initializes the testing environment. To implement symbolic files testing, KLEE creates a symbolic buffer and construct the relation between standard input and symbolic buffer, then KLEE select a state from the set of states to collect constraints information instrution by instrution. If KLEE meets the branch instruction (the definition is: according some conditions like value of a variable, you has two or more basic blocks to choose. For example, if, case, while and for) or alloc instruction, it will use solver to check the possible paths, then let forked state to wait until it is selected by searcher. If there are no execution states waiting to be selected, KLEE resolves the execution constraints per path to generate the test case which can explore the corresponding execution path.

Figure 4-1: Klee Architecture

4.2 The architecture of KLEE with regular expression symbolic files support

4.2.1 The intuitive idea

The most important key of this thesis is: how do we use regular infor-mation to limit constraints. Usually, the method we using regular expression is that let strings to match regular expression. But the situation of handling constraints is different from handling strings. The character value of a string is fixed, but the constraints may contains many possible values. Consider the difference, we consider two methods to handle this problem at first. The first method is : At KLEE selects a execution state, we use SMT solver to resolve constraints and get the value which is not contradictory to constraints. If the

value can not adhere to regular expressions, we discard the execution state.

The problem of this method is: The resolved value can not represent con-straints, it is only a possible value about constraints. So if a execution state is discarded, that does not mean the execution state can not satisfy regular expression. In some cases, this method filter more than 99 percentage of execution paths that is not contradictory to user-defined regular expression.

The second method is: We use regular expression to generate fixed value inputs, then convert these inputs to constant constraints. For every input, we create execution state and add corresponding constant constraints to ex-ecution state. This method also has some problems:

• The number of possible cases which can satisfy regular expression are significant.

• Be the same with the problem of fuzz testing, most of generated values are corresponding the same execution path.

4.2.2 The improved idea

According the analysis above mentioned, the two methods have some critical problems. The main reason is that we use fixed values to replace constraints or regular expression in some procedures, so some information is missing and inaccurate. We consider other methods which does not loss any information. The third method is ”regular expression expansion” method.

The ”expansion” means that expanding all possible value in non-fixed times expression range,then we convert these expanded regular expressions to fixed-length constraints. And we append every constraints to forked states, the advantage of this method is needless to use SMT solver (There is no failed re-sult by add constraints because the initial execution state has no constraint.) The disadvantage is that the number of possible expanding states may be

The fourth method is ”checking every time” method. When KLEE’s searcher select a execution state, we compare the constraints to regular ex-pression byte by byte. The advantage is we can append no constraints to states (because of the information of non-fixed times expression can not con-vert to constraints, so if we use third method, there still is some error), the disadvantage is that the number of solver query requests is significant. The number of solver query requests usually are bottleneck of symbolic testing, so this method is not efficient.

Finally, consider the size of standard input buffer usually has some limit, the number of possible expanding states can be reasonably limited, we use third method to implement.

Figure 4-2: Klee Modified Architecture

4.2.3 The Overview of Flowchart

Our method is described as following: First, at the moment of KLEE’s searcher selects a state to continue exploring paths, match the path con-straints to regular expression’s information, see Figure 4-2. If it can be resolved after adding regular expression’s information, still execute. If it can be resolved, then discard this execution path.

Because the smallest unit of regular expression is character and the size of character is byte, we operate adding constraints of regular expression information in byte level. The flow of our method is showed as Figure 4-3.

Figure 4-3: KLEE with Regular Expression Matching Logic

5 Implement

5.1 Data Struct

A regular Expression includes two important type of information. First type of information is contents of comparison, second type of information is the number range of match times. So we define two kinds of structure to record and describe the information of regular expressions.

5.1.1 ByteUnit

Structure ByteUnit consists of the type of contents of comparison. It consists of two attributes:

• type: We only implement following type about contents of regular ex-pressions:

– .: all, all possible character.

– \d: digit, from character ’0’ to character ’9’.

– \w: word, same as [a − zA − Z0 − 9 ].

– \s: space, \t, \n, \f, \r and \v.

– [...]: bracket, all list characters are possible values.

– [ˆ...]: nbracket , all list characters are impossible values.

• contents: A vector of characters. If the type is bracket, nbracket(not including beginning ˆ) or other, this attribute contains the character(s).

We handle the contents of bracket or nbracket type at patch function.

5.1.2 TimesExpr

Structure TimesExpr records the match times information (like {?,?},

?, * and +) about some elements of regular expression. It consists of four

• min: An unsigned integer records the minimum number of match times.

• max: An unsigned integer records the maximum number of match times. If the match times is a fixed value rather than a range, the value of max is zero.

• nextLevelTEs: A vector contains of TimesExpr objects, and these TimesExpr are sequencing by order. Because a regular expression usu-ally contains more than one match times information, and some match times information usually contains other match times information. For example (\d{2,4}){3}, the {3} contains the match times information {2,4} .

• unit: It contains a ByteUnit object.

To explain easily, in regular expression, parentheses represent a Time-sExpr. If a TimesExpr contains no nextLevelTEs, it must contain a ByteUnit object.

在文檔中藉由正規限制式之擬真運算處理符號檔案輸入 (頁 20-29)