be presented. In chapter 3, definitions and problem statement will be presented.
System architecture and proposed methodology will be presented in chapter 4.
Experiment results and observations will be presented in chapter 5. Finally, conclusion of this thesis will be presented in chapter 6.
Chapter 2 Related Works
In this chapter, we shall introduce basic terms of diversity index and three well‐known diversity indices. We shall give an example to illustrate them. Code coverage will be introduced as well. Criteria of code coverage will be illustrated by an
3
example.
2.1 Diversity Index
Overview
A diversity index is a statistic to measure the local members of a set consisting of various types of objects. It was first introduced in ecology [1] to measure the biodiversity in an ecosystem. It can also be applied in other areas, such as in economics to measure the distribution over sectors of economic activity in a region, and in information science to describe the complexity of a set of information. We shall use the terms of ecology to explain diversity index in the rest of this chapter.
There are two basic factors of measuring diversity index: species richness and species evenness [2]. Species richness is simply the number of species present in a system and makes no use of relative abundances. The more species present in a habitat, the richer the habitat. Species evenness is the relative abundance or proportion of individuals among the species.
To give an example, we sample two different fields for wildflowers, as shown in Table 1. Sample 1 consists of 300 daisies, 335 dandelions and 365 buttercups, while sample 2, 20 daisies, 49 dandelions and 931 buttercups. Both samples have the same richness, 3 species, and the same total number of individuals. However, the first sample has more evenness than the second. This is because the total number of individuals in the first sample is quite evenly distributed between the three species than the second.
4
Table 1: Examples of richness and evenness Numbers of individuals Flower species Sample 1 Sample 2
Daisy 300 20
Dandelion 335 49
Buttercup 365 931
Total 1000 1000
Simpson’s index
Simpson’s index [3], in terms of ecology, takes into account the species richness as well as the species evenness. The Simpson’s index D represents the probability that two randomly selected individuals in the habitat will belong to the same species.
The formula for calculating D is
∑ ,
where represents the total number of individuals of all species, the number of individuals in species i, S the number of species. However, we often use =1 to present diversity intuitively. Index 0, represents no diversity of species, while index 1, infinite. The bigger represents more diversity.
To give an example, consider the example in Table 1. We calculate diversity index for sample 1 and sample 2 as
1 1 0.666,
2 1 0.131.
Therefore, sample 1 is more diverse than sample 2 in the view of Simpson’s index.
Shannon’s index
The Shannon index H [4][5] takes into account the species richness and the species evenness as well. It is the information entropy of the distribution, treating
5
species as symbols as their relative population sizes as the probability.
The formula for calculating H is
∑ ln ,
where represents the relative abundance of each species, calculating as . is the number of individuals in species I, N the total number of all individuals, the number of species.
To give an example, consider the example in Table 1. We calculate diversity index for sample 1 and sample 2 as
1 =1.212,
2 =0.293.
Therefore, sample 1 is more diverse than sample 2 in the view of Shannon’s index.
Renyi’s index
Renyi’s index [6][7] is a generalization of Shannon’s index. The Renyi’s index of order α is defined as
∑ ,
where α 0, α 1 , represents the relative abundance of each species, calculating as , the number of individuals in species I, N the total number of all individuals and the number of species. Lower value of α, approaching zero, give a index which increasingly weights all possible events more equally, regardless of their probabilities. α which is approaching one gives the Shannon’s index. When α 0, it is the maximum possible Shannon’s index.
Comparison
All indices above consider both richness and evenness. We further compare them in the view of sample size sensitivity and difficulty of calculation [5]. Table 2 lists the comparison. Simpson’s index has lower sample size sensitivity than
6
Shannon’s index as well as Renyi’s index. In this thesis, packet number of packet traces in experiments ranges from 1 to 10000000, thus we prefer Simpson’s index.
Furthermore, calculation of Simpson’s index is simpler than Shannon’s index and Renyi’s index.
Table 2 Comparison between diversity indices
Simpson’s
index
Shannon’s index
Renyi’s index Sample size sensitivity Low High High
Calculation Simple Moderate Moderate
2.2 Code Coverage
Overview
Code coverage [8][9] was first mentioned in “Communications of the ACM” in 1963 by Miller and Maloney. It is a measure used in systematic software testing to describe the degree to which the source code of a program has been tested.
Software testing can be categorized simply into black‐box testing and white‐box testing. Black‐box testing is based on what a system is required to do, while white‐box testing is based on how a system operates. Coverage‐based testing provides a way to quantify the degree of thoroughness of white‐box testing [10].
Coverage criteria
To measure how well the program is executed by a test suite, we can use one or more coverage criteria. There are three basic coverage criteria [11][12]:
A. Function level coverage – The percentage of functions which have been called in a program.
B. Branch level coverage – The percentage of branches of control structures which have been decided in a program.
7
C. Line level coverage – The percentage of lines which have been executed in a program.
For example, consider the following C++ function.
int foo (int x, int y) {
int z = 0;
if ((x>0) && (y>0)) z=x;
return z;
}
Assume this function is part of some bigger program and this program is running with some test suite. If during the execution function “foo” is called at least once, the function level coverage for “foo” is satisfied. Branch level coverage can be satisfied with test cases that call foo(1, 1), foo(1, 0) and foo(0, 0). These are necessary as in the first two cases (x > 0) evaluates to true while the third false. Line level coverage can be satisfied if foo(1, 1) is called. In this case, every line in this function is executed.
Moreover, the increasing size and complexity of software system has led to increasing challenges in evaluating code coverage. There would be scalability issues with large software for common code coverage tools. Prioritized coverage approach [13] is proposed to provide capabilities for evaluating code coverage and setting priorities for testing.