ReFSM: Reverse Engineering from Protocol Packet Traces to Test Generation by Extended Finite State Machines

(1)

Journal of Network and Computer Applications 171 (2020) 102819

Available online 8 September 2020

Contents lists available atScienceDirect

Journal of Network and Computer Applications

journal homepage:www.elsevier.com/locate/jnca

ReFSM: Reverse engineering from protocol packet traces to test generation by extended finite state machines

Ying-Dar Lin

^a

, Yu-Kuen Lai

^b^,^∗

, Quan Tien Bui

^a

, Yuan-Cheng Lai

^c

aDepartment of Computer Science, National Chiao Tung University, Hsin-Chu, Taiwan

bDepartment of Electrical Engineering, Chung-Yuan Christian University, Chung-Li, Taiwan

cDepartment of Information Management, National Taiwan University of Science and Technology, Taipei, Taiwan

A R T I C L E I N F O

Keywords:

EFSM inference

Protocol reverse engineering Protocol semantic deduction

A B S T R A C T

Protocol reverse engineering is helpful to automatically obtain the specifications of protocols that are useful for network management, network security systems and test case generation tools. To achieve better accuracy, these kinds of applications require good models that can capture not only the order of exchanging messages (control flow aspect) but also the data being transmitted (data flow aspect). However, current techniques only focus on inferring the control flow represented as a Finite State Machine (FSM) and without interpreting the data flow. The Extended Finite State Machine (EFSM), embedding memory in the states and data guard in the FSM transitions, is a method commonly used to represent the data flow. In this work, we propose ReFSM, a novel approach to infer the EFSMs of protocols from only network packet traces. The proposed method is evaluated by using datasets of real-world network traffic traces of four protocols: FTP, SMTP, BitTorrent and PPLive. Based on the results, the coverage, accuracy scores of correctness and behavior of inferred models are always higher than 90%. The precision and recall values of message type identification are, at least, well above 94% and 96%, respectively. The inferred EFSMs are close to the correct model derived from protocol specification.

1. Introduction

Detailed understanding of protocol specification is helpful not only for the operation of network intrusion detection systems but also essential for the development of protocol fuzzer and test case generation tools. In the testing domain, smart fuzzers are one of the most useful tools for testing the robustness of the implementation of network security systems. Smart fuzzers need to know how to generate the right messages in the right states (Bossert et al.,2014). Furthermore, based on behavioral models, a set of conformance test cases aimed at verification can also be generated by automated test case generation tools (Tappler et al.,2017;Dahbura et al.,1990).

However, obtaining protocol specifications is a tedious and time- consuming task. For instance, it took more than 12 years to complete the reverse engineering of the SMB protocol of Microsoft Server Mes- sage Block (Cui et al., 2007). Even for open protocols such as FTP, HTTP, and MQTT, it takes much time and effort to carefully analyze and manually translate the open source documents to formulate a behavioral model. In addition to such known protocols, there are more than 40% of Internet traffic belong to such unknown application protocols (Wang et al.,2011), and specifications are not available for

∗ Corresponding author.

E-mail address: ylai@cnsrl.cycu.edu.tw(Y.-K. Lai).

those proprietary protocols. A considerable number of them are private protocols that enterprises tend to conceal while others are malware and botnets, which are hidden by attackers. Therefore, automatic protocol reverse engineering has recently been proposed to carry out such specification inference.

1.1. Motivation

As indicated in the PI Project (Beddoe, 2018), the methods of protocol reverse engineering can be divided into two types of execution trace(application inference) and network trace (network inference) (Sija et al.,2018;Kleber et al.,2019;Duchene et al.,2018). Although application inference can formulate a behavioral model very effectively, it is difficult to carry out as a result of the unavailability of specifications and source codes (Kleber et al.,2019).

The network inference method relies on protocol traces to actively reconstruct the behavioral model and infer protocol message formats. Intrusion detection systems typically rely on a parser to perform deep packet inspection based on protocol specifications (Paxson, 1999) (Amiri et al.,2011). The operation of network honeypots, which

https://doi.org/10.1016/j.jnca.2020.102819

Received 13 February 2020; Received in revised form 31 July 2020; Accepted 22 August 2020

(2)

is used to carry out malware analysis, requires protocol models and message formats so that interaction with the attacking endpoint becomes possible (Wang et al.,2012). Vulnerability discovery tools also leverage the protocol behavioral models, such as finite state machines (FSMs), to produce illegitimate and unexpected patterns (Bossert et al., 2014). For example, in order to reveal potential vulnerabilities of a targeted application, the knowledge of a wrong sequence number, acknowledgment number, and invalid transitions are essential.

Most of the time it is relatively easy to obtain the network traffic traces for protocol reverse engineering analysis from an Internet Service Provider. However, in behavioral model reconstruction, current implementations tend to focus only on the capture of control flow aspects as a Finite State Machines (FSM), without the reference to those in data flow (Duchene et al.,2018). Moreover, the message format inference result, enforced by the message type identification, is not accurate in both a keyword analysis approach and distance-based methods adopted in the construction of the FSM.

To improve the FSM performance, a behavioral model termed the Extended Finite State Machines (EFSM) and consisting of both data flow and control flow information, is proposed in this paper. An EFSM, comprising memories in states and data guards annotated in conditional transitions, can represent more specific and correct behaviors with a significant reduction of states compared to those of the FSM (Lorenzoli et al., 2006). Extended Finite State Machine (EFSM) inference is a promising solution to the behavior exhibited by traditional FSMs: the exhibiting behavior of FSM depends on an internal state (Foster et al., 2018). In other words, with the help of data guards and memories, an EFSM makes the transition conditional, as data checking is required before transitions are allowed (Petrenko et al.,2004).

1.2. Contributions

In this paper, we present the pioneering work on methodologies and system implementation that employ network traces as the input to infer the EFSM of given protocols. The processing flow chart of the proposed ReFSM, as shown inFig. 1, is composed of four main module: Data Pre- processing, Message Type Identification, FSM Reconstruction, and Semantic Deduction.

The Data Pre-processing module takes the raw packet data, reassemble it, and extracts network sessions. Abnormal messages removal and missing messages handling are part of the process. Based on the hybrid methodologies of Apriori and K-mean, the message type identification is responsible for clustering the messages from the data of the network sessions. The FSM reconstruction module infers the states and transitions of the finite state machine. It further minimizes the number of states by using k-tail merging methodology. The inter-message and intra-message dependencies are significant to the inference of the behavioral communication model. Therefore, the semantic deduction module is designed to detect the dependencies of these messages and finds the data guards and memories.

The main contributions of this work can be summarized as follows.

• ReFSM, a novel method to infer the EFSMs of network protocols, is proposed by considering not only the aspects of the control flow but also the data flow information from network traffic traces.

• A hybrid message type identification method consisting of keyword analysis and distance-based methods is proposed to overcome the isolating limitations for accuracy enhancement of inferred FSM.

• The methodologies of semantic deduction and the correlation detection technique are presented to infer inter-message and intra- message dependencies.

• This paper presents K-tail state merging to minimize the number of states and perform Daikon and Pearson product-moment for dependencies analysis.

• The use of data guard and memory is demonstrated with the accuracy-enhanced FSM for the proposed EFSM construction.

The structure of this paper is as follows. Section 2, gives a general background to the methodologies of protocol reverse engineering.

The proposed methodologies and implementations concerning data pre-processing, message type identification, FSM construction, and semantic deduction are described in Section3along with the problems addressed in this paper. In particular, Section3.5presents a detailed discussions on the major steps in the semantic deduction module, which produces the EFSM. Section 4discusses the setup of overall system testing and evaluation strategy. Detailed discussion of complexity comparisons are provided in Section5. Section6presents the experiment results and discussions related to performance and complexity. Finally, in Section7, we summarize the work presented and make suggestions regarding future work.

2. Background and related work

The ultimate goal of network-traces based protocol reverse engineering is to determine both message formats and behavioral models of communication protocols by trace analysis. Therefore, methodologies generally proposed can be categorized into message type identification, message format inference, semantic deduction, and behavior model construction (Kleber et al.,2019;Sija et al.,2018).

The ScriptGen tool (Leita et al.,2005) relies on the PI project (Bed- doe, 2018), which uses bioinformatics techniques to infer message format. PI uses the Needleman and Wunsch (N&W) (Needleman and Wunsch, 1970) sequence alignment algorithm not only to calculate distance score for message clustering (message type identification) but also to identify common parts of messages in the same cluster for the format of messages exposure. This concept is further developed in SciptGen and extended to reconstruct deterministic finite state machines to generate a set of scripts for a honeypot, which can mimic a protocol to interact with attackers. Based on the same approach, Netjob (Bossert et al.,2014) modifies N&W and Unweighted Pairwise Groups with Arithmetic Averaging (UPGMA) hierarchical clustering algorithm (Sokal and Michener, 1958) with the help of contextual information to enhance the accuracy of message type identification.

To this end, the authors developed a framework that allows actively collecting traces with context information.

The main idea proposed by Kleber et al.(2020) is to use vector distance on unequally-sized feature vectors by combining two popular methodologies of Hirschberg alignment and the DBSCAN cluster algorithm. However, Kleber et al. only focused on the message type identification, which is just one part of whole protocol-model reverse engineering. The system output is a set of message types of protocol. These authors are not concerned with the control and data flow of a protocol. Moreover, only stateless protocols are evaluated in the system. In another approach, ReverX (Antunes et al., 2011) and AutoReEngine (Luo and Yu, 2013) leverage keyword analysis to re-build a Finite State Machine. While ReverX constructs a graph by a generalization heuristic and then reduces it by conventional FSM minimization, AutoReEngine applies a popular data mining technique, Apriori, to find the most frequent string of bytes with keyword analysis for FSM construction. Without further minimization, the AutoReEngine model often contains a large number of states.

In contrast, Veritas (Wang et al.,2011) and PRISMA (Krueger et al., 2012) only focus on the construction of a behavioral model of protocols.

The reconstructed behavioral model of Veritas is a probabilistic protocol state machine, which is a form of non-deterministic FSM with the probability of transitions. In message type identification, Veritas also uses PAM for clustering with Jaccard Index (Jaccard,1912) as a distance score. Similarly, PRISMA itself defines a distance metric following an embedding concept in natural language processing. The messages are split into tokens by predefined delimiters (textual protocol) or fixed- size delimiters (binary protocol). It is then further mapped to vector space to directly calculate distance score applying a Euclidean metric.

(3)

Fig. 1. The system architecture of the proposed ReFSM. The Data Pre-processing module is designed to clean, reassemble and extract sessions. The Message Type Identification is responsible for clustering the incoming messages. The FSM Reconstruction module is used to infer the states and transitions of the finite state machine. The main tasks of Semantic Deductionmodule are to find the data guards and memories.

The MINT (Walkinshaw et al.,2016) technique proposed by Walkin- shaw et al. initiates a Prefix Tree Acceptor (PTA) from execution traces.

The PTA is then minimized and generalized by the Evidence-Driven State Merging algorithm to obtain the control part of the EFSM. The state pair with the highest scores of the transaction in the outgoing paths of the same label is likely to be merged. To infer the data guards, the author proposed the use of a broad family of data classifier algorithms in the machine learning domain.

Lo et al. presents (Lo et al.,2012) an empirical setup that compares four model-inference techniques from software execution traces: K- Tail (Biermann and Feldman,1972), kBehavior (Mariani et al.,2011), GK-Tail (Mariani et al.,2017) and KLFA (Mariani and Pastore,2008).

Similarly, the work of Krka et al.(2014) provides an empirical study that compares the quality of the inferred FSM model of four approaches: traces only, invariant only, invariant-enhanced-traces and trace-enhanced-invariants. The results highlight the benefit of combining the trace-based approaches and invariant-based approaches.

The methodology proposed byGoo et al.(2019) leverages another data mining technique called Continuous Sequential Pattern (CSP) to extract the probabilistic FSM of plain-text protocols. A hierarchical and recursive CSP is used in entire methods to infer field format, message types, and also the flow pattern that form the FSM. The CSP algorithm, modified from the Generalized Sequential Pattern (GSP) (Srikant and Agrawal,1996), only finds the continuous sequences with the restricted orders of items. This constraint of elements in sequences helps to find the keyword (static field) of the message more effectively. However, the protocols (HTTP & DNS) evaluated are stateless, not stateful. The authors do not mention data part and only focus on the control part of the protocol. Moreover, the proposed method does not perform any state merging to reduce the number of states and transitions, and the size of the FSM may thus be insignificant. They do not process the data flow, only infer the FSM by using the control ones.

Kleber et al.(2019)Sija et al.(2018) and Narayan et al.(2015) provided an extensive survey on these significant subjects of protocol reverse engineering. In these papers, the authors compare the related works in this field and present the detail of analyzing the technologies used in each step: pre-processing, message type identification, message format inference, and protocol model reconstruction.

Table 1presents the summary of all these selected works in protocol reverse engineering from network traffic traces.

2.1. Extended finite state machine

A good behavioral models of protocols should capture not only the order of exchanging messages (control flow aspect) but also the constraints on the data being transmitted (data flow aspect). One of the

Fig. 2. Example of the partial EFSM and FSM state diagrams for the TCP protocol.

The EFSM is capable of checking TCP sequence numbers in the transactions. In EFSM, sequence number of previous message is stored in memory of state of syn_sent, the data guard of this state will compare the sequence number of incoming messages with memory before perform the transition.

most widely-used models is the Extended Finite State Machines, which enhances Finite State Machines by embedding memory in the states and data guards in the transitions (Lorenzoli et al.,2006;Petrenko et al., 2004;Walkinshaw et al.,2016). An Extended Finite State Machine is typically defined as a 6-tuple EFSM=(𝑆, 𝐼, 𝑂, 𝜎, 𝛿, 𝑀), where 𝑆 is the set of finite states, 𝐼 is the set of inputs, 𝑂 is the set of outputs. 𝑀 is the set of memories equipped to states. A set of Boolean expression functions 𝛿∶ 𝐼 × 𝑀 → 𝑡𝑟𝑢𝑒, 𝑓 𝑎𝑙𝑠𝑒, defined in Eq.(3), is termed as data guards, and represents the data constraint validation before state transitions.

The symbol 𝜎 ∶ 𝑆 × 𝐼 × 𝑀 → 𝑆 × 𝑂 is the set of transitions regulating that at state 𝑠 ∈ 𝑆 and corresponding 𝑚 ∈ 𝑀, if the machines accept the input as 𝑖 ∈ 𝐼, the machines enter to state 𝑠^′ ∈ 𝑆 and produce output 𝑜 ∈ 𝑂.

Taking TCP handshake as an example, an EFSM uses a data guard to check the acknowledged number received and sent sequence numbers in memory.Fig. 2shows an example of EFSM and FSM as a part of TCP operation. After sending syn messages, the client machine enters the syn_sent state, writes the sending sequence number to memory, and waits for a response. When the syn_ack message is received, the client needs to check whether the received sequence number is equal to the value plus one in memory so that transiting to the established state can be allowed.

(4)

Table 1

The comparisons of related works in protocol reverse engineering from network traffic traces. The detailed discussions of complexity comparisons are provided in Section5. LS: Limited support, S: support.

Methodology Message type Behavioral models reconstruction Message Format Inference

Identification* Models Algorithms Algorithms Semantic

Deduction

Proposed Hybrid Apriori + K-mean Extended FSM

(ReFSM)

K-Tail Sequence Alignment S

AutoReEngine (Luo and Yu, 2013)

Frequent String Extraction

Apriori

(k-length constraint)

FSM Sequence Labeling

Apriori

Keyword Group Extraction N/A

CSP (Goo et al., 2019)

Hierarchical CSP Improved Apriori FSM Recursive CSP FieldHunter’s semantics inference

S

Discover (Cui et al.,2007)

Type based Tokenization Recursive Clustering

N/A N/A Type-based Sequence

Alignment

LS

NetJob (Bossert et al.,2014)

Distance based K-mean

(N&W alignment score)

FSM N/A Sequence Alignment LS

PRISMA (Krueger et al., 2012)

Distance based K-medoids clustering+Jaccard Index

FSM Generalization N/A N/A

ReverX (Antunes et al.,2011)

Keyword analysis PTA + Frequency Labels

FSM Generalization and

Minimization

Graph generalization N/A

ScriptGen (Leita et al.,2005)

Distance based Modified N&W alignment

FSM Generalization and

Minimization

Sequence Alignment N/A

TPRE (Trifil et al.,2009)

Statistical Keyword analysis

Variance of the Distribution

FSM State Splitting N/A N/A

Veritas (Wang et al.,2011)

Distance based Kolmogorov–Smirnov Statistical testing

Probabilistic FSM

N/A N/A N/A

2.2. Daikon and K-Tail algorithm

EFSM inference of communication protocols from only network traces has not yet been addressed, but several pioneering techniques have been proposed to infer EFSM model of software from software execution. The first method to attempt this issue is GK-Tail of Lorenzoli et al.(2006)(Lorenzoli et al.,2008). In this work, the authors used the K-Tail algorithm to infer the control part (FSM). The Daikon invariant detection system is adopted to derive the data guards as the extended part of the FSM (Mariani et al.,2017).

The K-Tail algorithm is a model reduction algorithm which helps to simplify the complicated models by merging their equivalent states.

The basic idea of the K-Tail algorithm (Biermann and Feldman,1972) is that two states are equivalent if they share the same behavior in the next k transitions. An example of k=2 is illustrated inFig. 3. The state 𝑆₂⁰ and 𝑆¹₂ are equivalent because they share the same 2-tails {(

𝑖₃∕𝑜₃) ,(

𝑖₄∕𝑜₄)

}, which is similar to the pair of (𝑆₃⁰, 𝑆^𝑢

3) because their 2-tails is {(

𝑖₄∕𝑜₄) ,(

𝑖₅∕𝑜₅)}. The result of this algorithm is the reduced FSM inFig. 4. This heuristic is effective for reducing non-deterministic FSMs where the standard deterministic FSM minimization algorithm fails. Software reverse engineering methods typically used it to obtain minimized models.

Daikon (Ernst et al., 2001) is an implementation of dynamic invariant detection. It is a template-based, machine learning technique that can be applied to arbitrary data. It can take the raw execution traces or the values of variables as input and finds the best matching properties (rules) for all observed values of variables. To illustrate the outputs of Daikon, we denote 𝑥, 𝑦 as variables and 𝑎, 𝑏 as constant.

Taking the input of all observed values of 𝑥, 𝑦, Daikon reports a simpler invariant (a constraint) with related constants 𝑎, 𝑏. For a single variable, it discovers a constraint that holds over its values. Daikon can produce simpler invariant such as being a constant (𝑥 = 𝑎), in enumerative sets of values (𝑥 ∈ 𝑎, 𝑏), or in a range that is restricted by values of minimum and maximum (𝑎≤ 𝑥 ≤ 𝑏). For multiple variables, it finds a correlation between the values of variables. It can also determine constraints based only on the dataset containing observed values of the variable so that it can be used to infer the data guard of EFSMs.

2.3. Message type identification

Message type identification, a core component of protocol reverse engineering, has significant effects on the accuracy of final results.

It also creates an abstraction of messages similar to group messages that are semantically related. Each group contains many messages which share the same structure pattern and similar purpose. Taking the FTP protocol as an example, the ‘‘USER’’ message type always contains the expression ‘‘USER <email>’’ and is used to authorize the username of systems. Current studies based on network traces often rely on keyword-based or distance-based clustering algorithms to identify protocol message types.

Keyword-based methods mostly leverage the frequently-occurring strings, also named as keywords (e.g., ‘‘USER’’, ‘‘PASS’’) to distinguish message types. They use statistical tests such as the Kolmogorov–

Smirnov test, Statistical t-test, Apriori (Agrawal and Srikant, 1994), and Distribution of Variances (Trifil et al.,2009) to extract distinctive keywords.

ReverX (Antunes et al., 2011) adopted the frequency labels, a keyword-based analysis methodology to construct the Prefix Tree Ac- ceptor (PTA). It assumed the message fields are usually split by predefined delimiters. The authors of TPRE (Trifil et al.,2009) assumed that there is a command/control byte (bit) that is located in the fixed position to represent the message type. The major cost of the proposed methodology is the computation of the Variance of the Distribution of the Variances (VDV) for every position in messages.

NetJob (Bossert et al.,2014) used the K-mean algorithm with the N&W alignment score as distance metric. In ScriptGen (Leita et al., 2005), the authors used the same methods of Netjob with modified N&W alignment. PRISMA (Krueger et al.,2012) adopted the k-medoids clustering algorithm with Jaccard index as distance metric.

As a result of applying the frequency of a string, this kind of method often ignores unique message types that rarely occur (e.g., ‘‘PWD’’,

‘‘QUIT’’). Another drawback is keyword misperception: There are some terms used in hot topics that appear frequently and are easily confused with actual protocol keywords. These limitations lead to the inaccurate identification of message types, as incorrect messages are grouped into different clusters.

(5)

Fig. 3. The initial FSM obtained from sessions of message sequences. For every input sequence, the process starts from the root and travels down along the tree. A new path is created if there are no existing paths. The state 𝑆₂⁰and 𝑆₂¹can be merged because of the shared 2-tails of 𝑖₃∕𝑜₃ and 𝑖₄∕𝑜₄. The pair of states 𝑆₃⁰and 𝑆^𝑢₃ can also be merged because of the shared 2-tails of 𝑖₅∕𝑜₅and 𝑖₄∕𝑜₄.

Fig. 4. The new FSM after merging by K-Tail mechanism with K=2 from the FSM shown inFig. 3. The state 𝑆⁰₂ and 𝑆₂¹is merged as 𝑆₂⁰. The pair of states 𝑆⁰₃ and 𝑆₃^𝑢is also merged as 𝑆₃⁰.

Distance-based methods apply an unsupervised machine learning algorithm with a similarity metric of messages for message clustering.

This approach merges pairs of clusters by their similarity. Among many machine learning techniques, unweighted pairwise groups with arithmetic averaging (UPGMA) and Partitioning around Medoid (PAM) are the most popular. Several metrics have been proposed, such as the Jac- card index (Jaccard,1912), Needleman and Wunsch (N&W) (Needle- man and Wunsch, 1970) or the Longest Common String (LCS). The disadvantage of this approach is that a knowledge of the number of clusters is required in advance. Sometimes, it is hard to define.

Otherwise, these methods are easy to apply under specified or over specified clustering in practice.

2.4. Inter and intra message dependencies

The term ‘‘semantic deduction’’ refers to a process that infers inter- and intra-message dependencies that exhibit the semantic meaning of fields in messages. The term inter-message dependency represents the relationship of a field that regulates the property of another field in different messages, such as cookies (HTTP) and sequence number (TCP). Intra-message dependency is a correlation between fields within one message, for example, consistency fields as check-sum or direction field as length and offset.

Although inter and intra-message dependencies play an essential role in building the right messages and interactions, only a few existing works can support such inferences. As a result of a massive amount of

possible dependencies, all approaches eventually leverage heuristics for searching with the support of human interpretation. Finding the most matches in a predefined set of rules is offered by PRISMA (Krueger et al., 2012) and ScriptGen (Leita et al., 2005) for the semantic deduction. However, they require a few manual interpretation step to establish a set of rules and can only cover a little semantics.

3. ReFSM System Architecture

The ReFSM processing model, consists of four major modules: data pre-processing, message type identification, FSM construction and semantic deduction, is illustrated inFig. 1. First, the traffic traces of a particular protocol have to be pre-processed in steps: message reassembly, session extraction, and cleaning. The message type identification module uses the Apriori keyword analysis to extract protocol keywords and to determine the number of the clusters before clustering messages into groups by k-means algorithm. Each group is considered as a different message type. The results are then fed to the FSM construction module to infer the FSM by using initialization and K-Tail merging algorithm. After the correct FSM has been determined, the semantic deduction module extracts sub-datasets containing values of message fields in observed messages. Further analysis is performed to search the correlation of fields in the messages. The result, thus deduced, is used to form the data guards on each transition in the EFSM.

(6)

3.1. Problem statement

Given a set of network traces 𝑇 𝑟 containing sessions of target proto- col, the objective of our work is to reconstruct an accurate behavioral model for protocol in the form of EFSM with six-tuple (𝑆, 𝐼, 𝑂, 𝜎, 𝑀 and 𝛥) information. The greater correctness and coverage scores reflect the quality of the model.

Table 2 lists the notations used in this work. A packet, denoted as 𝑃 𝑘 is the primary element of communication between two entities.

The session 𝑆𝑒𝑠 = {𝑃 𝑘_𝑖|𝑖 = 1, 𝑁} is the ordered sequence of packets;

it is defined by its five-tuple source address, source port, destination address, destination port and timestamp. The input in our system is the network traces 𝑇 𝑟 = {𝑆𝑒𝑠_𝑗|𝑗 = 1, 𝐾} which consists of set of sessions.

We assume that each session only contains information of the target protocol. The set of states, inputs, outputs, transitions, memory and data guard of EFSM are denoted by 𝐼, 𝑂, 𝜎, 𝑀 and 𝛥. The 𝐸𝐹 𝑆𝑀^𝑖𝑛𝑓 represents the inferred model reconstructed from network trace 𝑇 𝑟.

The 𝐸𝐹 𝑆𝑀^{𝑡𝑟𝑢𝑒}represents the actual model extracted from the exact specification of the target protocol.

3.2. Data pre-processing module

This module takes traffic traces from TCP dump files as input. At the beginning of this process, messages are parsed to extract sessions based on the 5-tuple packet header fields: source address, source port, destination address, destination port and type of transport layer protocol.

The fragmented messages are then assembled, and the duplicates and re-transmissions are removed. The remaining messages continue to be parsed by ignoring unrelated messages; only the payloads containing the information of the target application protocol are kept for further analysis.

3.3. Message type identification module

Hybrid keyword-analysis and distance-based approaches are used to increase the quality of message type identification.

Based on the modification of AutoReEngine (Luo and Yu, 2013), the Apriori keyword analysis is used to identify the keywords. The algorithm interactively finds the high frequency and close sequences of bytes (or string) with stable position variance by the Apriori method.

In each iteration, only closed sequences with a frequency higher than a pre-defined threshold are defined as keywords.

After the keywords extraction process, the keyword series observed in the dataset can be used as the unique format of message types. Each keyword series is a group. The extracted keyword series resolve the limitation of having the number of clusters in advance of k-means. Also, the number of clusters is also determined before the k-means algorithm.

The distance metric is based on the Jaccard index (Jaccard, 1912) defined as 1 −^|𝑎^⋂^𝑏^|

|𝑎⋃𝑏| where 𝑎, 𝑏 are the character array of the messages.

The issues of threshold and keyword misperception are resolved by the iterative k-means clustering based on the distance metric. The k-means clustering algorithm helps to calibrate the keyword extraction in the first step. The undecided message sets are kept to overcome the issue of a missing keyword so that the extraction process can be repeated.

Because the size of the dataset is reduced, the keywords which have a low occurrence in the original dataset can be revealed. For instance, in Fig. 5, the keywords ‘‘QUIT’’ and ‘‘RNFR’’ are recovered in the second iteration. The procedure ends when the undecided set is empty or acceptable.

3.4. FSM construction module

The goal of this module is to infer a conventional FSM model with 4-tuple: (𝑆, 𝐼, 𝑂, 𝜎) before the construction of the EFSM model. The proposed methodology consists of two parts: the initialization and state merging.

As network traces are collected from real-world traffic, some sessions are incomplete, and messages have ordering issues due to packet loss and network delay. Thus, data cleaning techniques such as sequence reorder and missing value inference needs to be applied to achieve a better result for the FSM construction.

A message requested by a client and a response from a server can be represented as (𝑖_𝑘, 𝑜_𝑘), where 𝑖_𝑘, 𝑜_𝑘∈ 𝑇and 𝑇 is the set of message types. The session can be represented as a sequence of messages 𝑆𝑒𝑠 = (𝑖₁, 𝑜₁), (𝑖₂, 𝑜₂) … (𝑖_𝑚, 𝑜_𝑚). In the first step of processing, a dataset is fed to the initialization module to build a Prefix Tree Acceptor, which accepts all session sequences. For every session sequence, the process starts from the root and travels down along the tree. As shown inFig. 3, a new path is created if there are no existing paths.

Subsequently, the initial FSM is interactively refined by merging equivalent states based on the K-Tail mechanism. The intuition is that the protocol state machine always exposes the same behaviors in the same state. In other words, the machine produces the same outputs if the same inputs are submitted at a particular state. Because of the deterministic characteristic of protocol state machines, the next subsequent states in K-Tail of two equivalent states can be merged as illustrated inFig. 4.

The state merging procedure is defined in Algorithm1. For each state 𝑠 ∈ 𝑆, we define k-tail(s) as the set of next transitions. Pairs of states are all compared iteratively to initialize a list of sets consisting of equivalent states. Based on the list available, two sets are merged if they share at least one state in common until no two sets can be merged. At the end of this step, we obtain a list of sets consisting of equivalent states, and no two sets consist of the same state. For every set, a new state is created as the representative for all states. Then, for every original transition, a new transition is added between two new states. These represent the start and end states of the initial FSM.

Algorithm 1 K-Tail merging algorithm Merge(𝑎, 𝑏) = 𝑎 ∪ 𝑏

Input :𝑆 set of states, 𝑇 set of transitions Output: 𝑆^′, 𝑇^′

Initialize 𝑄=𝜙

Foreach pair 𝑠𝑖, 𝑠_𝑗∈𝑆do

if (k-tails(𝑠𝑖) == k-tails(𝑠𝑗))then 𝑄 ← 𝑄⋃ (𝑠𝑖, 𝑠𝑗) While STOP_FLAG do

STOP_FLAG ← FALSE Foreach 𝑞𝑖, 𝑞𝑗∈𝑄do

if 𝑞_𝑖⋂𝑞_𝑗≠ 𝜙 then STOP_FLAG ← TRUE 𝑄← 𝑄− 𝑞𝑖

𝑄← 𝑄− 𝑞𝑗

𝑄← 𝑄⋃

𝑀 𝑒𝑟𝑔𝑒(𝑞𝑖, 𝑞_𝑗) Endwhile

Foreach 𝑞𝑖∈𝑄do 𝑠^′_𝑖←𝑛𝑒𝑤𝑆𝑡𝑎𝑡𝑒() 𝑆^′←𝑆^′⋃𝑠^′_𝑖 Foreach 𝑡𝑗∈𝑇 do

𝑡^′_𝑗←𝑛𝑒𝑤𝑇 𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛() Foreach 𝑞_𝑖∈𝑄do

if 𝑡𝑗.𝑠𝑡𝑎𝑟𝑡𝑆𝑡𝑎𝑡𝑒∈𝑞𝑖then 𝑡^′_𝑗.𝑠𝑡𝑎𝑟𝑡𝑆𝑡𝑎𝑡𝑒←𝑠^′_𝑖 if 𝑡𝑗.𝑒𝑛𝑑𝑆𝑡𝑎𝑡𝑒∈𝑞𝑖then 𝑡^′_𝑗.𝑒𝑛𝑑𝑆𝑡𝑎𝑡𝑒←𝑠^′_𝑖 𝑇^′←𝑇^′⋃𝑡^′_𝑗

(7)

Table 2 Table of notations.

Category Notation Description

Entity

𝑃 𝑘_𝑖 The i_𝑡ℎpacket

𝑆es = {𝑃 𝑘_𝑖|𝑖 = 1, 𝑁} The session is sequence of packets.

𝑇 𝑟= {𝑆𝑒𝑠𝑗|𝑗 = 1, 𝐾} Traces is set of sequences

Process

𝑆= {𝑆𝑡} The finite set of states

𝐼= {𝐼𝑙} The finite set of inputs

𝑂= {0_𝑘} The finite set of outputs

𝑀= {𝑀_𝑠

𝑡} The finite set of memories on states

𝛥= {𝛱|𝛱𝑙∶ 𝐼 × 𝑀 → {0, 1}} The data guards

𝜎 The set of transitions

𝐸𝐹 𝑆𝑀^𝑖𝑛𝑓 The six-tuple of inferred EFSM

Evaluation

𝑆𝐼^{𝑡𝑟𝑢𝑒} The sequences of true model ‘s input

𝑆𝐼^𝑖𝑛𝑓 The sequences of inferred model ‘s input

𝐸𝐹 𝑆𝑀^{𝑡𝑟𝑢𝑒} The six-tuple of true EFSM

𝐸𝐹 𝑆𝑀(𝑆𝐼) The sequences of output generated by submission 𝑆𝐼 to 𝐸𝐹 𝑆𝑀 𝐶𝑜𝑣=^{𝐸𝐹 𝑆𝑀}^𝑖𝑛𝑓^(𝑆𝐼^{𝑡𝑟𝑢𝑒}⁾

|𝑆𝐼^{𝑡𝑟𝑢𝑒}| The coverage: How many sequences of true input are accepted by inferred model.

Coverage ⇔ Probability (𝐸𝐹 𝑆𝑀^{𝑡𝑟𝑢𝑒}⊂ 𝐸𝐹 𝑆𝑀^𝑖𝑛𝑓)

𝐶𝑜𝑟=^{𝐸𝐹 𝑆𝑀}_|𝑆𝐼^{𝑡𝑟𝑢𝑒}𝑖𝑛𝑓^(𝑆𝐼| ^𝑖𝑛𝑓⁾ The correctness: How many sequences of inferred input are accepted by true model.

Coverage ⇔ Probability (𝐸𝐹 𝑆𝑀^𝑖𝑛𝑓⊂ 𝐸𝐹 𝑆𝑀^{𝑡𝑟𝑢𝑒})

𝑚 The average number of messages in a single cluster.

𝑛 The number of messages.

𝑙 The max length (bytes) of a single message.

𝑑 The total number of possible fields.

𝑡 The number of states in the Prefix Tree Acceptor (PTA).

Fig. 5. A typical example of the processing flow for message types identification with two iterations on the FTP dataset. After the first iteration, with high occurrence frequency, the keyword of PORT, RETR, PASS and LIST are discovered. The rest of the keywords will be processed in the second iteration.

3.5. Semantic deduction module

After the 4-tuple (𝑆, 𝐼, 𝑂, 𝜎) finite state machine is constructed, we start to generate the data guards and memories (𝛿, 𝑀) for each transition to form the 6-tuple (𝑆, 𝐼, 𝑂, 𝜎, 𝛿, 𝑀) extended finite state machine. In the beginning, the module identifies the type of message and decompose the messages into fields. Then it validates the values of each field to decide whether further actions should be performed.

3.5.1. Fields extraction

At this stage, the format of each message type is inferred to obtain fields and the sub-dataset. For every cluster of different message types, the multiple Needleman and Wunsch sequence alignment algorithm (Bossert et al., 2014) is used to extract the format. As a result of the high computational complexity of this algorithm, a progressive alignment (Durbin et al., 1998), based on a pre-build guide tree to decide the order, is used for the alignment process. The results are composed of a consensus string, dynamic fields, and static fields. The static fields are similar to those protocol keyword series extracted

(8)

before. A sub-dataset for each dynamic field is built by obtaining all observed values in the traces and grouping by the response message type.

Take the FTP protocol for example, given a message of “PORT 65,240,180,205,56,56”, the FTP server identifies that the request com- mand is ‘‘PORT’’ and the data portion is “65,240,180,205,56,56”. The first four numbers of the data portion are identified as the encoding of IP address and the remaining are the port number. The FTP server further checks that the port number is within the range of 0 and 255.

Based on this, a field extraction is adopted to build the sub-dataset containing all values of fields before ReFSM uses Daikon to infer the constraints hold the overall values of a certain field. An illustrated example of PORT message alignment is given inFig. 6. The left-hand side presents the messages in traces and the N&W alignment progress.

The right-hand side shows the sub-dataset of each field in the traffic traces corresponding to the response code of “200”. The sub-datasets of fields are then mined to deduce the protocol semantics.

3.5.2. Deriving data guards

Assuming that the message 𝑖 = [𝑓₁, 𝑓₂,… , 𝑓_𝑡]consists of fields 𝑓_𝑡. The data predicate of a field is defined as

𝑃 𝑟_𝑢(𝑣) =

{ 1 𝑖𝑓 𝑣∈ 𝐷𝑎𝑖𝑘𝑜𝑛 (𝑓_𝑢)

0 . (1)

As mentioned above, the data guard 𝛿 act as predicate on the input message. Then the data guard function 𝛿(𝑖, 𝑚) can be defined as

𝛿(𝑖, 𝑚) = 𝛿(𝑓₁, 𝑓₂,… , 𝑓_𝑘, 𝑚) (2)

≈

⋀𝑡 1

𝑃 𝑟_𝑢(𝑓_𝑢)

⋀𝑡 1

𝐼 𝑡𝑟_𝑢𝑣(𝑓_𝑢, 𝑓_𝑣)

⋀𝑡 1

𝐼 𝑡𝑒_𝑢𝑚(𝑓_𝑢, 𝑚). (3)

The parameters are listed as follows:

• 𝑃 𝑟𝑢(𝑓_𝑢) → 0, 1is a data predicate of field 𝑓_𝑢for data validation.

As illustrated in Eq.(1), this function checks whether the value of field 𝑓_𝑢is valid or not.

• 𝐼𝑡𝑟𝑢𝑣(𝑓_𝑢, 𝑓_𝑣) → 0, 1is a constraint between values of two fields 𝑓_𝑢, 𝑓_𝑣, which is also known as the intra-message dependency.

• 𝐼𝑡𝑒𝑢𝑚(𝑓_𝑢, 𝑚) → 0, 1is a constraint between value of field 𝑓_𝑢and the memories 𝑚 stored in previous message. It is the inter-messages dependency.

For example, the IP address and port number format in the argument of FTP’s PORT command are regular expressions of the first type.

The check-sum and direction fields, such as length and offset, are the intra-message dependencies. The cookies in the HTTP protocol are inter-messages dependencies.

3.5.3. Constraints on fields

ReFSM relies on the Daikon (Ernst et al.,2001) algorithm to generate the data guard of each field. Daikon deduces a broad set of values in sub-dataset into a simpler invariant for ReFSM to use as a data guard.

Principally, 𝐷𝑎𝑖𝑘𝑜𝑛 ( ) takes the raw traces or the values of variables 𝑓_𝑢 as input and finds the best matching properties (rules) for all observed values of variables. As shown inFig. 7, Daikon worked on the sub-dataset of field 𝑓₁ of PORT command data and found that the value of 𝑓₁must be in the range of 0 and 255. Similarly, the arguments of the TYPE command are in the enumerable set of {“A,” “I”}. If this constraint is not verified, the finite state machine may replies code of 500 instead of code of 200.

3.5.4. Inter/intra message dependency

The proposed methodology continues to identify relations between fields in one or two messages for the formulation of a data guards function. The core process leverages the Pearson coefficient representing the strength of dependency on all pairs of attributes in the fields. Potential candidates are then selected by using Daikon to infer the relationships.

At first, the dataset of attributes is prepared. The process consists of the field’s value, length, and offset in the protocol message to capture a different kind of relationship. Note that other fields such as IP address and port number are also taken into account. All observed pairs of field attributes are computed interactively over sessions in the network traffic traces.

Once this has been carried out, the Pearson product-moment corre- lation coefficient 𝜌(𝑋, 𝑌 ) is computed based on each observed pair of field attributes (𝑋, 𝑌 ). The absolute value of the correlation coefficient

|𝜌(𝑋, 𝑌 )| indicates the strength of the relationship. If the value is close to one, there is generally a linear correlation between 𝑋 and 𝑌 . If the value is close to zero , they are mostly independent. For instance, the correlation coefficient value is close to one on the two attributes of Content-Length (HTTP header) and Length (IP header) because they are linear-dependent. Finally, these dependencies are simplified by applying the Daikon algorithm so that a linear relationship of (𝑌 = 𝑎𝑋+

𝑏) can be derived. The dependencies between fields in the same message are classified as the 𝐼𝑡𝑟 function. For the dependencies between fields in two messages are classified as the 𝐼𝑡𝑒 function.

3.5.5. Transform FSM to EFSM

As mentioned above, we need to infer the 2-tuple data guard and memory (𝛿, 𝑀) on each transition of the FSM. The data guard function 𝛿, as illustrated in Eq. (3), can be inferred by merging the 𝑃 𝑟, 𝐼𝑡𝑟 and 𝐼𝑡𝑒 functions. The values of fields are updated and kept in the memory of each state for future interactions. A field’s values in previous messages can be assigned and kept in the memories of states. However, the size of the memory required is enormous, mainly because of the sequences of the messages that need to be processed. A simple heuristic on the pair of inter-message dependencies can be adopted to further reduce the size of the memory. Initially the field values of every message are kept in the memory temporarily, and can be eliminated if values are not used in the following messages. Once one of the fields in the inter-message dependencies pairs appear in the future, the corresponding value in the memory is kept. The memory in the states completes the 6-tuple of EFSM.

4. Evaluation

This section details the system setup and experimental results of the proposed ReFSM. Its purpose is to evaluate the quality of the inferred EFSM on a given protocol from network traces.

For this evaluation, we use the precise as correctness score and recall as coverage score. The sequence generated by the exact model 𝑆𝐼^{𝑡𝑟𝑢𝑒} and the inferred model 𝑆𝐼^𝑖𝑛𝑓 is handled by the model of 𝐸𝐹 𝑆𝑀^𝑖𝑛𝑓 and 𝐸𝐹 𝑆𝑀^{𝑡𝑟𝑢𝑒}, respectively, to obtain output. The coverage score represents how the inferred model accepts many sequences of valid input. The correctness is used to represent how many sequences of inferred input are accepted by the actual model. The acceptance means the final state reachability of each input.

In order to measure the effectiveness of the proposed ReFSM, metrics of the pair-wise Precision and Recall (Hatzivassiloglou and McKe- own,1993) values are used to evaluate the effectiveness of clustering algorithms for message type identification. For the quality of inferred EFSM, the correctness, coverage score, and behavioral accuracy are adopted for comparison to the conventional FSM. AutoReEngine (Luo and Yu,2013), relying on the Apriori algorithm (Agrawal and Srikant, 1994) with a pre-defined threshold and the variances of position to filter out keywords is selected for the comparative study. The experimental setup, including datasets and metrics, is presented first. Then, the evaluation results and the analysis are described in detail.

(9)

Fig. 6. An example of Needleman and Wunsch (N&W) alignment sequences for field extraction. After the alignment procedure, the format of the message type PORT becomes 𝑝𝑜𝑟𝑡 𝑓₁, 𝑓₂, 𝑓₃, 𝑓₄, 𝑓₅, 𝑓₆, where 𝑓_𝑡represents a variable. Those values of each field are used to build dataset for each field..

Fig. 7. An example presents the data predicates derived by Daikon. Daikon deduces the broad set of values on the left-hand side to a simpler invariant on the right side. In this example, Daikon infers the valid value of 𝑓_𝑡in the range of (0,256), and the argument of TYPE message to be {‘A’,’I’}.

4.1. Experimental setup

The prototype of ReFSM is implemented in Java language of approx- imately 6600 lines of code based on the libraries of jNetPcap, Libpcap and Daikon (Ernst et al., 2001). The system is capable of accepting network capture files in tcpdump format as input. The Daikon inference engine is adopted to derive constraints as data guards on the transition between states in the EFSM.

Four protocols of FTP, SMTP, BitTorrent, and PPLive are selected to evaluate the methodology proposed. The datasets of network traces are collected from publicly available and self-capture sources. The real- world FTP traces are used from Lawrence Berkeley National Laboratory (Pang and Paxson,2003). The anonymized network traces consist of more than 210,000 FTP messages of 1,800 sessions. However, these traces still contain malformed messages, and extra steps to eliminate illegal messages are needed. The FTP traces are merged with network- ing traces captured from the NCTU university networks. The SMTP and BitTorrent datasets also are captured from the university network.

Unveiling specifications of proprietary and closed protocols are one of the motivations of this work, so that PPLive, one of the most popular P2P video streaming application, is selected.Table 3summarizes our datasets used in the evaluation.

4.2. Pair-wise precision and recall

Pair-wise Precision and Recall is one of the most frequently-used scores to evaluate the effectiveness of clustering. It is used to measure the quality of the proposed message type identification method. We denote 𝐸 as the sets of expected clusters, and 𝑅 as the set of actual

Table 3

Summary of datasets used in the system performance evaluation.

# Protocol Source Number of Messages Number of Sessions

1 FTP LBNL/Generated 217,281/5,235 1,841/563

2 SMTP Generated 4,862 426

3 BitTorrent Generated 3,289 260

4 PPLive Generated 8,092 281

clusters as a result. We calculate the true positives 𝑡𝑝 by counting the number of shared pairs in 𝐸 and 𝑅. The false positives 𝑓 𝑝 represents the number of a pair in 𝑅 but not in 𝐸. The false negatives 𝑓 𝑛 represents the number of a pair in 𝐸 but not in 𝑅. Then, the precision and recall are defined as 𝑟𝑒𝑐𝑎𝑙𝑙 = ^𝑡𝑝

𝑓 𝑛+𝑡𝑝 and 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = ^𝑡𝑝

𝑓 𝑝+𝑡𝑝. The expected clustering results are manually derived from the protocol specification. Taking FTP as examples, 33 commands in RFC 959 (Postel and Reynolds, 1985) can be extracted as 33 clusters of FTP messages of client sides.

Similarly, there are 11 messages clusters in BitTorrent version 1.0 and 9 clusters in SMTP.

4.3. Coverage and correctness scores

In order to compare the performance of two FSMs, previous work (Wang et al.,2011;Trifil et al.,2009; Krueger et al.,2012;Antunes et al., 2011) relied on the correctness and coverage (also known as Precision and Recall, respectively) to compare the inferred FSM and the one deriving manually from protocol specification.

Coverage is how much of the specification has been inferred by the model. It is calculated by the percentage of actual sessions (generated

(10)

by the true model) accepted by the inferred model. Correctness is the percentage of inferred sessions (accepted by the inferred model) are genuinely accepted by the true model. The acceptance is defined as the last state’s reachability. In this work, these scores are used to demonstrate the proposed FSMs part of the inferred EFSM is at least equivalent or better than those in other works on FSM inference.

4.4. Behavioral accuracy

The techniques of correctness and coverage are, however, inade- quate to assess the effectiveness of the EFSM, because the data guards are not covered and the outputs are not taken into consideration.

Produced at states that exhibit the protocol’s response when re- ceiving the incoming request, the outputs are the only information that can be observed to verify the internal states of machines. These are required for the application of behavior-based models, such as conformance test-case generation and fuzzing tools (Dahbura et al., 1990). An illustrated example with the FTP protocol is given inFig. 8.

The same input sequences are applied to both models of FTP protocol and obverse the outputs sequences. The generated output code values are {331,230,200,221} while the expected output code values are {331,230,500,221}. The differences are the return codes of 500 and 200, where the port argument of the PORT command is invalid. The evaluation of previous works failed in this case because both models can still reach final states.

It follows then that, the metric inspired by the protocol conformance test-case generation method and software specification mining theme (Dallmeier et al., 2012) are adopted here. The set of test cases, consisting of the input sequence and corresponding expected outputs, is prepared in advance. These inputs are then submitted to the inferred model to obtain the outputs and compare these to the expected out- puts. As described, the index of behavioral accuracy, defined as 𝐵𝐴 =

𝑁 𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑐𝑐𝑒𝑝𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡𝑠

𝑇 𝑜𝑡𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡𝑠 , is an important quality indicator for both syntax and semantics of an inferred model of the protocol.

4.5. Test methodology

First, the actual models of protocols are needed to compare with the inferred models. For known protocols, we manually derived the accurate models from standard specifications. The model of FTP was extracted from RFC 959. The models of BitTorrent and SMTP were derived from specifications. We adopted the model in the work of Veritas (Wang et al., 2011) for the test because no PPLive standard specification is available. To obtain an accurate result and eliminate the test set dependencies, the k-folds cross-validation (Kohavi,1995), one of the most popular evaluation method in machine learning, was applied. The set of traces was randomly split into 𝑘 partitions. Over 𝑘 iterations, one partition was kept to generate test cases and the remaining partitions were used to infer the models. The final score is the average of scores of 𝑘 iterations.

To calculate the coverage score, we needed a set of positive ses- sions, i.e., a sequence of accepted inputs by the actual model. For example, a session that consists five FTP messages of “USER anony- mous,” “PASS <password>,” “PORT 140,113,207,14,88,90,” “RETR /pub/info2017” and “QUIT” (denoted as sequence of {USER, PASS, PORT, RETR, QUIT}) was accepted as the model that reached the final state. These sessions were submitted to the inferred models to calculate the number of sessions to be accepted.

To compute the correctness score, both the positive sessions and the negative ones rejected by the actual model were necessary. Thus we had to generate a negative session beforehand, one that was still accepted by our model but could potentially be rejected by the actual model. For this, we randomly selected and mutated positive sessions similar to the methods presented in work by Antunes et al. (2011).

Principally, the mutation of accepted sessions such as swapping messages, deleting and adding more messages is repeatedly performed

to generate rejected sessions. For instance, the session {USER, PASS, PORT, RETR, QUIT} is accepted but the sessions {PASS, USER, PORT, RETR, QUIT} and {USER, PASS, PORT, QUIT, RETR} were rejected.

Finally, to compute the correctness score, the test set consisting of both accepted and rejected sessions were submitted to inferred models. The ratio of accepted to rejected sessions affect the correctness score. In this work, the accepted and rejected session ratios of 4:1, 9:1 and 2:1 were used. Due to limited space, the results shown in this paper were based on the ratio of 4:1.

5. Cost analysis and comparison

Most of the protocol reverse engineering tools are developed toward a specific design targets such as the inference of message format, reconstruction of the behavior model, or both of them for different use cases (Kleber et al., 2019).Table 1lists the comparisons of selected works in protocol reverse engineering from network traffic traces. In particular, the core algorithms utilized for the message type identification are listed since it is the typical processing step commonly found in these works of literature.

The proposed ReFSM architecture, as presented inFig. 1, realizes four of the primary processing steps targeting the use case of network test case generation. In the beginning, the system conducts the packet pre-processing to obtain the protocol messages. The cost is proportional to the number of messages 𝑛.

The ReFSM system uses two techniques: Apriori algorithm and K- means for the message type identification. The time complexity of the K-means clustering is 𝑂(𝑛²). For the Apriori algorithm, the complexity is 𝑂(∑𝑙

2𝑘(𝑘 − 2)|𝐶𝑘|), where |𝐶𝑘| denotes the number of items in the set containing sub-sequence of k-byte keywords (Tan et al., 2006) and l is the max-length of a message. Fortunately, armed with the pruning method of a specific threshold, the message type identification can be processed efficiently. The CSP is derived from the Generalized Sequential Pattern (GSP) (Srikant and Agrawal,1996), an optimized Apriori algorithm, so that the keywords found by Apriori cover all keywords found by CSP with less computational complexity. In general, the empirical evaluation using real-world data sets indicates that the processing time of GSP is faster than that of the Apriori algorithm (Agrawal and Srikant, 1994). ReverX (Antunes et al.,2011) adopted the frequency labels, a keyword-based analysis methodology to construct the Prefix Tree Acceptor (PTA) for message type identification.

NetJob (Bossert et al., 2014) used the K-mean algorithm with the N&W alignment score as distance metric. The cost of calculating the distance metric is 𝑂(𝑙²), where 𝑙 is the maximum length of a message.

In ScriptGen (Leita et al.,2005), the authors used the same methods of Netjob with modified N&W alignment. Overall, the complexity is still the same. PRISMA (Krueger et al.,2012) took up the k-medoids clustering algorithm with Jaccard index as distance metric. The cost of Jaccard index is 𝑂(𝑙²).

The finite state machine construction process consists of two significant steps: Prefix Tree Acceptor (PTA) generation and state merging.

The system has to process every message to generate the PTA and therefore, the complexity is 𝑂(𝑛). In the step of state merging, the system searches all pairs of most likely equivalent states. It takes 𝑂(𝑡²), where 𝑡 is the number of states in the PTA. In the worst-case scenario, 𝑡 equals 𝑛 (the number of messages), and we can approximate the complexity to be 𝑂(𝑛²).

The progressive Needleman and Wunsch sequence alignment is conducted for clusters in the process of semantic deduction. A cluster consists of a set of messages having a similar structure. For the N&W algorithm, the complexity is 𝑂(𝑚³𝑙²), where 𝑚 is the average number of messages in a single cluster and 𝑙 is the max length (bytes) of a single message. Overall, for all clusters, we can do the sequence alignment in parallel to reduce the processing time to 𝑂(𝑛𝑚²𝑙²).

Finally, in order to find the message dependencies, the system has to compute the Pearson correlation coefficient for every possible pair