Real time P2P traffic identification

Chapter 2 Related Work

2.3 Real time P2P traffic identification

Timely P2P traffic identification is essential if network management is intended to quickly react to the presence of P2P traffic. To meet this requirement, [13] [14] [18]

present a ML based schemes, they use the first five packets of the flow to perform

identification, but they cannot deal with missing of some initial packets of a flow. In [20], it uses the count of the total number of a flow to do classification. The key points of each approach are summarized in Table 3, including the proposed RTI.

10 for filtering out non-P2P packets

P2P-using host identification phase

P2P file sharing traffic identification phase

Using three heuristics to identify P2P-using hosts

(H3, H4, H5)

Flows of P2P-using hosts

Building flow groups (H6)

Using three heuristics to identify P2P flow group

Figure 2. The flow chart of the proposed RTI scheme.

Chapter 3 Design Approach

In this chapter, we introduce our heuristic-based real time file sharing traffic identification (RTI) scheme that is divided into three phases: pre-processing phase, P2P-using host identification phase and P2P file sharing traffic identification phase,

as summarized in Figure 2. The purpose of the pre-processing phase is to speed up the identification process by using port number information. In the second phase, we identify which hosts use P2P applications. Then we can filter out the traffic from non-P2P using hosts and further identify P2P file sharing traffic from P2P-using hosts.

There are total nine heuristics in our proposed RTI scheme, as shown in Table 4.

Table 4. Heuristics summary.

Heuristic Purpose In which phase

H1: well-known ports Using well-known ports to identify non-P2P packets

Pre-processing phase H2: port association

If a subsequent port number is same as or

contiguous with the previous ones, then these

ports are associated

H3: the ratio between

#dPort and #dIP

P2P applications usually have the same number of distinct IP and ports; this heuristic can be used to

identify P2P-using hosts P2P-using host identification phase H4: the usage of UDP

packets

P2P applications must have UDP packets H5: the number of distinct

dIP

For file sharing purpose, P2P peers usually connect

to multiple peers

H6: building flow groups

Similar to H2. All flows in a flow group have associated port numbers

P2P file sharing traffic identification phase H7: the ratio between

#dPort and #dIP

Similar to H3, but used to identify P2P flow groups H8: the number of distinct

dIP

Similar to H5, but used to identify P2P flow groups

H9: the number of packet size switching (PSW) in

UDP packets

P2P file sharing applications use UDP for

signaling traffic, and its packet size is usually small; this heuristic is good for filtering Skype

traffic

3.1 Pre-processing phase

In this phase, we do some pre-processing to filter out some non-P2P packets, and use port numbers to speed up the identification process.

H1: well-known ports: Even though classification based on port numbers is unreliable, well-known port numbers are still suitable to identify traffic of some common applications. P2P applications usually choose port numbers from 1024-65535, so applications with all well-known ports (0-1023) excluding web ports (80, 443, etc.) are non-P2P. Note that HTTP ports are not only used for web surfing but also for some P2P applications.

H2: port association: In general, when a session of an application needs to build multi-connections, the kernel assigns a continuous ports range for each application, even for applications using randomized port numbers. We maintain the hosts and their listening port numbers in a table called Ports Association Table (PAT), in which each item is a two-tuple <IP, port>. When a packet arrivals, if the source or destination IP address can be found in the PAT and source or destination port numbers are contiguous or same as that in the PAT, then the packet belongs to a non-P2P application. To update the PAT, when a packet arrivals, if the source port is a well-known port, then add the destination IP and port into the PAT, and vice versa. In [10], we know that port association can speed up the identification process and make it more accurate. The port association algorithm in pseudo code is given in Figure 3.

The time complexity of this algorithm is O(#packets).

Building flows: After applying the above two heuristics, we filter out the packet whose nonP2PFlag is marked false. And we build flow by five-tuple: source IP address (sIP), source port (sPort), destination IP address (dIP), destination port (dPort), and transport layer protocol.

Port Association Algorithm Input: all packets in the trace

Output: packets with nonP2PFlag = false Alogorithm:

1: boolean nonP2PFlag; // true: non-P2P 2: for ( each packet )

3: if ( both sPort and dPort are well-known ports ) 4: return nonP2PFlag = true

5: else if ( sPort is well-known port )

6: Add dIP and dPort into PAT //PAT: Ports Association Table 7: return nonP2PFlag = true

8: else if ( dPort is well-known port ) 9: Add sIP and sPort into PAT 10: return nonP2PFlag = true 11: else

12: if ( sIP in PAT ) and ( sPort is contiguous or same as that in PAT)

13: return nonP2PFlag = true

14: else if ( dIP in PAT ) and ( dPort is contiguous or same as that in PAT)

15: return nonP2PFlag = true 16: else

17: return nonP2PFlag = false 18: end for

Figure 3. Port association algorithm in pseudo code.

P2P Host Identification Algorithm

Input: the packets not filtered in the pre-processing phase Output: all flows of the P2P-usimg hosts

Alogorithm:

1: for ( each host h )

2: Allflows = get all flows of host h 3: while ( Allflows.readLine != null )

4: #dIP = the number of distinct destination IP addresses 5: #dPort = the number of distinct destination port numbers 6: ratio = #dPort/#dIP

7: if ( there is a UDP packet in Allflows ) 8: UDPFlag = 1;

9: end while

10: if ( ( ratio >= 0.85 ) && ( UDPFlag = =1 ) && ( #dIP > 9 ) ) 11: h = P2P-Using host

12: end for

Figure 4. P2P-using host identification algorithm in pseudo code.

3.2 P2P-using host identification phase

We use three heuristics to find out P2P-using hosts in this phase. The proposed heuristics include some thresholds which were derived empirically through experiments based on a number of traces. The P2P-using host identification algorithm in pseudo code is given in Figure 4. The time complexity of this algorithm is O(#hosts．

#flows).

H3: the ratio between #dPort and #dIP: As indicated by [7][15] and Figure 1, P2P peers usually maintain only one connection to other peers, which means that each host has the same number of distinct destination IP addresses (#dIP) and number of distinct ports (#dPort) connected to it. In our RTI scheme, if the ratio between #dPort and #dIP from a certain host is less than 0.85, the host is considered as a non-P2P host.

H4: the usage of UDP packets: Most P2P file sharing applications use UDP to find a peer neighbor or share files with a peer. If there is no UDP packet, the host is considered as a non-P2P host.

H5: the number of distinct dIP: Non-P2P traffic typically uses multiple connections to one server. If the number of distinct destination IP addresses is less than or equal to 9, the host is considered as non-P2P host.

3.3 P2P file sharing traffic identification phase

This phase uses four heuristics to identify P2P file sharing traffic of the P2P-using hosts which were identified in the previous phase. Instead of inspecting every flow, the identification is associated with flow groups. We can identify one packet flow as long as the flow is associated with others. The P2P file sharing traffic identification algorithm in pseudo code is given in Figure 5. The time complexity of this algorithm is O(#flowgroup．#UDPpackets).

H6: building flow groups: This heuristic uses the property of port association.

Figure 6 shows that an example of port locality for a specific host that its packets are separated into three groups.

The following heuristics are only concerned of UDP packets.

P2P File Sharing Traffic Identification Algorithm Input: all flows of P2P-using hosts

Output: P2P file sharing traffic Alogorithm:

1: Use port association to build flow groups 2: for ( each flowgroup fg )

3: AllPackets = get all UDP packets of flow group fg 4: while ( Allpackets.readLine != null )

5: #dIP = the number of distinct destination IP addresses 6: #dPort = the number of distinct destination port numbers 7: ratio = #dPort/#dIP

8: if ( || packetSize – lastPacketSize || >= 365 ) 9: #PSW ++

10: end while

11: if ( ( ratio >= 0.85 ) && ( #PSW < 11 ) && ( #dIP > 3 ) ) 12: all traffic in fg = P2P file sharing traffic

13: end for

Figure 5. P2P file sharing traffic identification algorithm in pseudo code.

H7: the ratio between #dPort and #dIP: Similar to H3, if the ratio between

#dPort and #dIP from a specific flow group is less than 0.85, the flow group is considered as a non-P2P flow group.

H8: the number of distinct dIP: Similar to H5, if the number of distinct destination IP addresses is less than or equal 2, the flow group is considered as a non-P2P flow group.

Figure 6. An example of port locality for a specific host.

0 5000 10000 15000 20000 25000 30000 35000

0 50000 100000 150000 200000 250000 300000 350000

Port number

Packet sequence number Port locality

16001

H9: the number of packet size switching (PSW) in UDP packets: Packet size switching was originally proposed by [11] for identifying P2P flows, but we only use it for UDP packets. PSW is the number of packet size switching between a packet and its previous packet exceeding 365 bytes. If the number of PSW in UDP packets is greater than or equal to 11, the flow group is considered as a non-P2P flow group.

P2P file sharing applications use UDP for signaling traffic, the packet size is usually small. This heuristic is good for filtering out Skype traffic, because media traffic flowed directly between Skype clients over UDP [17].

32679

80 16001

32679

Chapter 4 Evaluation

4.1 Trace collection

Traffic traces used for experiments in this research were captured from the dormitories of the National Chiao Tung University on February 25, 2009 from 3:00:01 a.m. to 3:00:06 a.m. (t1) and 20:00:00 p.m. to 20:00:05 p.m. (t2). These traces are 5 seconds long, which was pre-classified by a payload-based classifier for verifying our P2P traffic identification results. There were 610 and 838 users, respectively. We identified P2P file sharing traffic which included BitTorrent, eDonkey, and Gnutella applications. And we also prepared a longer trace of 250 seconds on February 25, 2009 from 3:00:01 a.m. to 3:04:11 a.m. (t3) for John [9], another heuristic-based scheme, for comparison.

The information we were concerned on packets are source IPs and ports, destination IPs and ports, transport layer protocol, and packet length. These data can be found in the header easily, and we did not inspect any payload.

4.2 Performance metrics

Two performance metrics were used to evaluate the effectiveness of our scheme.

They are Accuracy = (TP + TN) / N and False Positive Rate (FPRate) = FP / (FP + TN) [12], where TP represents the number of correctly identified samples of P2P, TN represents the number of correctly identified samples (packets, flows, or flow groups)

Figure 7. The accuracy of P2P-using host identification phase.

The ratio between #dstPort and #dstIP (H3)

Accuracy of non-P2P, FP represents the number of falsely identified samples that belong to P2P, and N represents the total number of samples, which equals to TP+TN+FP+FN.

4.3 Results of P2P-using host identification

The accuracy and FPRate obtained by applying different combinations of H3 ~ H5 are presented in Figures 7 and 8. When the threshold of H3 was set to 0.85, we got the highest accuracy of 92.295% and the FPRate of 6.579% in the host level.

Figure 8. The FPRate of P2P-using host identification phase.

The ratio between #dstPort and #dstIP (H3)

FPRate

Figure 9. The accuracy of P2P traffic identification phase.

92%

The ratio between #dPort and #dIP (H7)

Accuracy

4.4 Results of P2P file sharing traffic identification

The results of P2P traffic identification are shown in Figures 9 and 10. When the threshold of H7 was set to 0.85, we got the highest accuracy of 98.288% and FPRate of 1.442% in the flow group level. For trace t2, the accuracy is 97.114% and FPRate is 2.286% in the flow group level.

Figure 10. The FPRate of P2P traffic identification phase.

The ratio between #dPort and #dIP (H7)

FPRate

4.5 Compared to existing approaches

The overall results are presented in Figure 11, which involves three real traces.

For our RTI using trace t1, the accuracy is 96.19% and the FPRate is 3.5% in the flow level. For our RTI using trace t2, the accuracy is 95.262% and the FPRate is 5.549%

in the flow level. For Perenyi [8] using trace t1, the accuracy is 70.95% and the FPRate is 81.77% in the flow level. For John [9] using trace t1, the accuracy is 64.76% and the FPRate is 74.19% in the flow level. Note that the accuracy of John [9]

is worse than that of Perenyi [8] is because that the duration of t1 is too short for John’s heuristics, particularly the heuristic of IP/port pairs, as shown in Table 2.

We also implemented John [9] using t3, the accuracy is 82.824% and the FPRate is 48.647% in the flow level. The results are better than those using t1. However, its FRPate is still high. This is due to their thresholds and heuristics are not suited for our traces. In Table 5, we show the performance evaluation results of different approaches

Accuracy and FPRate (%)

Figure 11. The accuracy and FPRate of our scheme in comparison with those of existing schemes.

Table 5. The performance evaluation results of different approaches using trace t1.

Approach

Chapter 5 Conclusion and Future Work

5.1 Concluding remarks

In this thesis, we have presented three phases with nine heuristics to identify P2P file sharing traffic in real time. Our RTI method operates at three levels: (1) the packet level: using well-known port numbers to filter non-P2P packets, (2) the host level:

finding out which host has used P2P file sharing applications, (3) the flow group level:

identifying P2P file sharing traffic from the P2P-using hosts. These heuristics derived from the behaviors of P2P applications and port numbers information. We have applied our RTI method to real traces without accessing any packet payload information. Experimental results have shown that the proposed RTI had high accuracy of 96.2% and low false positive rate (FPRate) of 3.5% by using only 5 seconds of a real trace. This means the proposed RTI can identify P2P file sharing traffic in real time to facilitate network management for dealing with the problems of internet piracy and unreasonable utilization of network resources.

5.2 Future work

In our RTI scheme, we only considered the issue of identifying P2P file sharing traffic. Our future work will focus on P2P applications classification to identify a specific P2P application (e.q., BitTorrent or eMule, etc.) for achieving more effective network management.

References

[1] Subhabrata Sen , Jia Wang, Analyzing peer-to-peer traffic across large networks, Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment,

November 06-08, 2002, Marseille, France.

[2] “BitTorrent,” [Online]. Available: http://www.bittorrent.com/.

[3] “eMule,” [Online]. Available:

http://www.cs.huji.ac.il/labs/danss/presentations/emule.pdf.

[4] “Internet Assigned Numbers Authority (IANA),” [Online]. Available:

http://www.iana.org/assignments/port-numbers.

[5] S. Subhabrata, S. Oliver and D. Wang, “Accurate, scalable in-network identification of P2P traffic using application signatures,” in Proceedings of Thirteenth International World Wide Web Conference, pp. 512-521, 2004.

[6] Thuy T. T. Nguyen, Grenville Armitage, "A survey of techniques for internet traffic classification using machine learning," in IEEE Communications Surveys &

Tutorials, vol. 10, no. 4, Mar 2008, pp. 56-76

[7] T. Karagiannis, A. Broido, M. Faloutsos, and K. Claffy, “Transport layer identification of p2p traffic,” in Proceedings of the 4th ACM Conference on Internet Measurement, 2004, pp. 121-134.

[8] M. Perenyi, D. Trang Dinh, A. Gefferth, and S. Molnar, “Identification and analysis of peer-to-peer traffic,” in Journal of Communications, vol. 1, no. 7, 2006, pp. 36–46.

[9] W. John and S. Tafvelin, “Heuristics to Classify Internet Backbone Traffic based on Connection Patterns,” In ICOIN ’08: Proceedings of the 22nd International Conference on Information Networking, January 2008, pp.1-5.

[10] Y.D. Lin, C.N. Lu, Y.C. Lai, W.H. Peng, P.C. Lin, “Application classification using packet size distribution and port association,” in Journal of Network and Computer Applications, 2009

[11] F. G. Chou, "P2P Flow Identification," Master Thesis, National Taiwan University of Science and Technology, Taiwan, 2006.

[12] http://en.wikipedia.org/wiki/Receiver_operating_characteristic

[13] L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule, and K. Salamatian, “Traffic classification on the fly,” in ACM Special Interest Group on Data Communication (SIGCOMM) Computer Communication Review, vol. 36, no. 2, 2006.

[14] L. Bernaille, R. Teixeira, and K. Salamatian, "Early Application Identification,"

in International Conference On Emerging Networking Experiments And Technologies, no. 6, December 2006.

[15] T Karagiannis, K Papagiannaki, and M Faloutsos, "BLINC: Multilevel traffic classification in the dark, " Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications, August

22-26, 2005, Philadelphia, Pennsylvania, USA

[16] “Gnutella hosts,” [Online]. Available: http://www.gnutellahosts.com.

[17] S. Baset and H. Schulzrinne, "An analysis of the skype peer-to-peer Internet telephony protocol," in Columbia University Technical Report CUCS-039-04, September 2004, pp.1-11.

[18] Jun Li, Shunyi Zhang, Yanqing Lu, and Junrong Yan, “Real-time P2P traffic identification,” in Global Telecommunications Conference, Nov 2008, pp.1-5.

[19] A. Madhukar and C. Williamson, “A Longitudinal Study of P2P Traffic Identifiaction,” in MASCOT’06, August 2006, pp. 179-188.

[20] Jeffrey Erman, Anirban Mahanti, Martin Arlitt, Ira Cohen, and Carey

learning,” in 26th International Symposium on Computer Performance, Modeling, Measurements, and Evaluation, Volume 64, Issues 9-12, October 2007, pp.

1194-1213.

在文檔中基於連線模式之即時P2P檔案分享的流量辨識方法 (頁 19-0)