• 沒有找到結果。

5.2.1. Data-Centric Communication

In the traditional IP network, nodes communicate to each other by the fixed IP addresses. This communication model is proven, by the daily operation of Internet, effective in supporting applications running on static and full-fledged computers. For mobile and resource-limited sensor networks, how to configure and reconfigure node addresses in the presence of node dynamics poses a great challenge. This problem is first raised and addressed in one of the pioneer work on sensor networks [51]. In that, the authors propose the data-centric communication paradigm.

In data-centric communication, digital information are disseminated based on the feature/attribute/content of the information itself, not the addresses of issuers or receivers. In the first data-centric routing mechanism for wireless sensor networks [1], sinks send explicit interest packets to set up routing states at the intermediate nodes.

These interest specific routing states in turn draw in the data of interest for the sinks.

Such dissemination scheme relies on well-defined naming system to describe data attributes and sink interests. The corresponding naming system and the filter-based forwarding mechanism are detailed in [55]. The string matching problem, although recognized as the performance bottleneck, is not addressed.

5.2.2. Publish/Subscribe Systems

In [7][61], the authors design a set of efficient forwarding and routing mechanisms for content-based data dissemination. The notion of content-based communication is essentially the same as data-centric communication. The mechanisms proposed, although descends from the literature of publish/subscribe systems, are applicable to sensor networks. There are two different kinds of publish/subscribe systems:

channel-based and content-based. In both systems, multiple users may subscribe to the data of interest. In channel-based systems, the users subscribe to a particular channel and the corresponding data broker pushes particular data to the channel from which the subscribing users receive the data. In content-based systems, the concept of channel is refined as rules and interests, traveling through the intermediate nodes between brokers and subscribers. If the rules belonging to some subscribers are matched by certain data, the data will be forwarded further to the indicated output ports, otherwise the data will be dropped by the node. Much of the improvement in rule matching concentrates on the management of the subscriptions, i.e., organization of predicates and rules. For instance, index algorithms, used also extensively in database management, are adopted by [56] to manage the subscriptions and speed up the matching process.

Other data structures such as button-up selection trees and binary decision diagrams are proposed to improve the matching performance by [57] and [58] respectively.

Little work has addressed formally the problem of string matching.

5.2.3. Ad Hoc Publish/Subscribe Systems

The close relationship between publish/subscribe systems and data-centric sensor networks is first formally noted by [62]. The limited energy, computing power, and memory space on a typical wireless sensor node give rise to unique challenges in the design of data forwarding algorithms. Authors of [63] found that the interest diffusion mechanisms used by general publish/subscribe systems will not be suitable for

resource-limited wireless sensor networks. Energy-efficient routing schemes such as [64][65] are proposed to minimize the interest dissemination overhead. Despite the level of activities in energy-efficient publish-subscribe routing for sensor networks, the problem of efficient forwarding receives little attention.

5.2.4. String Matching Algorithms

String matching problems are well studied in the domain of bio-informatics, AI, and data mining. From the classic KMP [66] matching algorithm to the more recent ones, there has been a range of schemes proposed. TST [67] is a state-of- the-art algorithm that enables efficient binary search of strings. It is widely used in dictionary lookup and information retrieval.

ST [68] is a data structure for very high speed string matching. To handle dynamics in wireless sensor networks, an efficient algorithm must be able to insert attribute and value strings onto the suffix tree in an efficient ways. The online tree construction algorithm for suffix tree proposed by Ukkonen [59] satisfies the requirement. In [60], a memory improvement technique is presented to remove the redundancy inside internal nodes of suffix trees . Through the new data management of suffix tree, the space requirement of a static suffix tree can be reduced and be less than four integers per node in average. In this work, we adopt ST for string matching involved in the data forwarding process for sensor networks.

5.3. ALGORITHMS

We introduce in this section the four string matching algorithms in comparison. We begin with the straightforward algorithms, hash and TST. Details of the ST algorithm and its improvement will follow.

5.3.1. Hash

We use a rotation-based hash function [69] to construct the hash table of attribute and value strings and to search for strings in the hash table. The result of the hash function is calculated from the whole input string. Suppose every character in the string is coded as a number and s[i] represent the ith character in the string. For an n-character string, the hash value, h(n), is defined as follows:

h(n) = h(n − 1) << d + s[n − 1]

h(0) = C

C is a pre-defined constant. We can obtain the hash value of an n-character string by iterating the shift and add operations n times. h(n) further takes a modulo m for a hash table of m slots. The final hash value is used to associate the string with the string’s

location in the hash table. A string with hash value j will be stored in the jth slot.

Multiple strings might be hashed to the same slot. A simple linked list is used to handle the collisions. When a new interest packet is received, the attribute or value string will be inserted into the linked list at the corresponding slot. Similarly, when a data packet is received, the hash function is applied to obtain the hash values of the attribute and value strings. The system then may look up the hash table to see if there’s a matching attribute and value, i.e., a matching predicate.

5.3.2. Ternary Search Tree (TST )

String search using TST is similar to binary search. In TST, the data structure of a node contains a character and 4 pointers. The character is also known as the key for string comparison. Three of the pointers are used to track the decendant nodes cl, cm and cr, cl leads to substrings that begin with a character alphabatically smaller than the key. cm leads to substrings that begin with a character that equals the key.

Likewise, cr leads to substrings that begin with a character alphabatically greater than the key. When a match is identified, one follows the 4th pointer to the matched entry.

Figure 13 shows the TST after inserting five words: egg, gas, get, aids, and bad.

Following the principle of binary search, for an incoming string ”bad”, the search path on the tree will

be e → a → b → a → d.

The TST search algorithm is detailed below in Algorithm 1. To insert a string, one search on the existing TST first. From the branch the search stops, the remaining substring is attached to extend the TST. Let gdp be a string to be inserted into the TST in Figure 13, for example. The search stops at the g node on the right. To this end, part of the incoming string, g, is already be matched. To complete the insertion, new nodes representing the remaining substring, dp, will be created and attached one by one onto the branch.

Figure 13. An example of a ternary search tree.

5.3.3. Suffix Tree (ST )

The concept of suffix tree is better explained by a more intuitive data structure called suffix trie. A suffix trie for a string S is a joint tree of its suffixes S1, S2, ..., to S|S|.

Let Si denotes the suffix of S starting from the ith character. For instance, in Figure 14(a), there are 8 suffixes for string bbabbaab, and bbaab is the forth suffix. Once we have the suffixes, we can build a suffix trie for the original string. Figure 14(b) shows the suffix trie of bbabbaab. The right (red dash) edge accounts for the character b and the left (black) one for a. The number on the leaves shows which suffix of the string on the interest table the incoming string matches to. Given the original string and the suffixes, it is trivial to extend the algorithm for substring matching. Any substring of bbabbaab must be the prefix of one of its suffixes. A naive method to create a

suffix trie is iteratively inserting the suffixes into the trie from the longest to the shortest. The construction cost is O(|S|2) steps, but it is clear that any substring search can finish in O(|P|) steps for any input string P.

ST retains this strong suit because it is essentially a compact representation of suffix trie. In a suffix trie, each edge carries one character. ST is more compact in that an edge on ST may carry a sequence of characters. A suffix trie can be transformed to an ST, and vice versa. Figure 14(c) is the ST of Figure 14(b). The pair-wise number on each edge denotes the beginning and end indices of the original string. For example [2,3] is to indicate the substring from the 2nd character to the 3rd, i.e., ba in the example. The numbers on the leaves represent the matched suffix. The curly (green) edges are traveled when none of the child matches. In this case, we can declare the exact match

fails.

As for the insertions, we adopt Ukkonen’s online construction [59]. With that, one can insert a new string T into ST by gradually adding characters T1, T2, ... , T|T| into the data structure without breaking any properties of the original tree.

Taking advantage of the suffix edges, each insertion can be completed in O(|T|) time.

This fast construction of ST allows us to build a suffix tree in linear time. We apply ST on the attribute and value matching for sensor data forwarding. To enable exact string matching, each suffix tree is initialized by adding a special character $, which is considered a character that will never appear in any input strings. At the initialization stage, the ST, T , begins with an empty string followed by the special character, $. In each insertion, an input string Ti will be extended by adding one $ at its end and concatenates to the ST set, T . The string Ti$ is then added to the ST structure. After k insertion operations, the ST set, T , equals string $T1$T2$...$Tk−1$Tk$. An exact search for string P can be achieved by searching the string $P$ in T . Other string operations such as prefix and suffix can also be achieved by using strings like $P or P$ as input. We apply also a memory requirement reduction technique proposed in [60] to improve the scalability of ST in space.

Figure 14. Illustration of the relationship between suffix trie and tree

5.3.4. Suffix Tree with Space Improvement (ST+)

The drawback of ST is the high memory consumption. We conducted preliminary experiments to examine different possible sensor network related word sets on ST to check the space efficiency. The results shown in Figure 15 indicate that there are only 10% of nodes that have more than 5 children regardless of the size or the

characteristics of the word set.

We exploit this property and present a memory optimization scheme for ST , namely, ST+. The main strategy concentrates on optimizing the memory usage of internal tree node. This is because of the observation that the larger the training/table word set is, the more internal nodes there will be. Therefore, the memory consumption can be effectively improved. will be higher.

Figure. 15. ST children number distribution over different wordsets As described in 5.3.3, each node maintains a direct mapping of

(NextCharacter,NextNode) pairs. This is memory consuming. To mitigate this effect,

we use two types of internal nodes, M-type and H-type internal nodes. M-type node is the same as internal nodes described in 6.3.3. H-type node uses another hash function to map its child nodes to a smaller hash table. The difference between ST and ST+ is that ST+ uses M-type nodes when a node is frequently traversed, such as the attribute strings, or when the number of its child nodes exceeds a certain threshold. Using these two criteria, we classify the internal nodes in ST , and then keep the mapping in the critical nodes direct for speed and the mapping in other nodes indirect, i.e., another hash table, for space efficiency.

Table 3. A comparison of theoretical attributes of matching algorithms.

Table 3 provides as a concise overview of the theoretical average search time, insertion time and memory space requirement for the algorithms in comparison. We will elaborate in this subsection the results presented in the table in detail. The

terminology in this section and the rest of the paper related to strings are listed below:

• S: The training word set consist of N strings from S1 to SN, the total word length of S is represented by |S|

• P: The test string, and the length of the string is |P|

Meanwhile, the operations we would use for fast forwarding are defined as follows:

• Search exact match through the data structure (exact match): For a input string P, search through all the strings in the training set S to see if there’s a string P* that is exactly the same as P; If such string exists, then return the index of the string, otherwise, the operation returns a search fail notice.

• Search for prefix/suffix/substring match through the data structure: For a input string P, search all the strings in the training set S to see if there’s a set of string {P’} such that every strings in {P’} has more than one prefix/suffix/substring that exactly matches P.

• Insertion: Insert a string P into the data structure.

• Deletion: Delete a given string P from the data structure. In our implementation, the prefix/suffix/substring matching are both applying the algorithms listed in [52]. In consequences, the performance of these operations thus can analogize to exact match.

相關文件