P2P file sharing traffic identification phase

Chapter 3 Design Approach

3.3 P2P file sharing traffic identification phase

This phase uses four heuristics to identify P2P file sharing traffic of the P2P-using hosts which were identified in the previous phase. Instead of inspecting every flow, the identification is associated with flow groups. We can identify one packet flow as long as the flow is associated with others. The P2P file sharing traffic identification algorithm in pseudo code is given in Figure 5. The time complexity of this algorithm is O(#flowgroup．#UDPpackets).

H6: building flow groups: This heuristic uses the property of port association.

Figure 6 shows that an example of port locality for a specific host that its packets are separated into three groups.

The following heuristics are only concerned of UDP packets.

P2P File Sharing Traffic Identification Algorithm Input: all flows of P2P-using hosts

Output: P2P file sharing traffic Alogorithm:

1: Use port association to build flow groups 2: for ( each flowgroup fg )

3: AllPackets = get all UDP packets of flow group fg 4: while ( Allpackets.readLine != null )

5: #dIP = the number of distinct destination IP addresses 6: #dPort = the number of distinct destination port numbers 7: ratio = #dPort/#dIP

8: if ( || packetSize – lastPacketSize || >= 365 ) 9: #PSW ++

10: end while

11: if ( ( ratio >= 0.85 ) && ( #PSW < 11 ) && ( #dIP > 3 ) ) 12: all traffic in fg = P2P file sharing traffic

13: end for

Figure 5. P2P file sharing traffic identification algorithm in pseudo code.

H7: the ratio between #dPort and #dIP: Similar to H3, if the ratio between

#dPort and #dIP from a specific flow group is less than 0.85, the flow group is considered as a non-P2P flow group.

H8: the number of distinct dIP: Similar to H5, if the number of distinct destination IP addresses is less than or equal 2, the flow group is considered as a non-P2P flow group.

Figure 6. An example of port locality for a specific host.

0 5000 10000 15000 20000 25000 30000 35000

0 50000 100000 150000 200000 250000 300000 350000

Port number

Packet sequence number Port locality

16001

H9: the number of packet size switching (PSW) in UDP packets: Packet size switching was originally proposed by [11] for identifying P2P flows, but we only use it for UDP packets. PSW is the number of packet size switching between a packet and its previous packet exceeding 365 bytes. If the number of PSW in UDP packets is greater than or equal to 11, the flow group is considered as a non-P2P flow group.

P2P file sharing applications use UDP for signaling traffic, the packet size is usually small. This heuristic is good for filtering out Skype traffic, because media traffic flowed directly between Skype clients over UDP [17].

32679

80 16001

32679

Chapter 4 Evaluation

4.1 Trace collection

Traffic traces used for experiments in this research were captured from the dormitories of the National Chiao Tung University on February 25, 2009 from 3:00:01 a.m. to 3:00:06 a.m. (t1) and 20:00:00 p.m. to 20:00:05 p.m. (t2). These traces are 5 seconds long, which was pre-classified by a payload-based classifier for verifying our P2P traffic identification results. There were 610 and 838 users, respectively. We identified P2P file sharing traffic which included BitTorrent, eDonkey, and Gnutella applications. And we also prepared a longer trace of 250 seconds on February 25, 2009 from 3:00:01 a.m. to 3:04:11 a.m. (t3) for John [9], another heuristic-based scheme, for comparison.

The information we were concerned on packets are source IPs and ports, destination IPs and ports, transport layer protocol, and packet length. These data can be found in the header easily, and we did not inspect any payload.

4.2 Performance metrics

Two performance metrics were used to evaluate the effectiveness of our scheme.

They are Accuracy = (TP + TN) / N and False Positive Rate (FPRate) = FP / (FP + TN) [12], where TP represents the number of correctly identified samples of P2P, TN represents the number of correctly identified samples (packets, flows, or flow groups)

Figure 7. The accuracy of P2P-using host identification phase.

The ratio between #dstPort and #dstIP (H3)

Accuracy of non-P2P, FP represents the number of falsely identified samples that belong to P2P, and N represents the total number of samples, which equals to TP+TN+FP+FN.

4.3 Results of P2P-using host identification

The accuracy and FPRate obtained by applying different combinations of H3 ~ H5 are presented in Figures 7 and 8. When the threshold of H3 was set to 0.85, we got the highest accuracy of 92.295% and the FPRate of 6.579% in the host level.

Figure 8. The FPRate of P2P-using host identification phase.

The ratio between #dstPort and #dstIP (H3)

FPRate

Figure 9. The accuracy of P2P traffic identification phase.

92%

The ratio between #dPort and #dIP (H7)

Accuracy

4.4 Results of P2P file sharing traffic identification

The results of P2P traffic identification are shown in Figures 9 and 10. When the threshold of H7 was set to 0.85, we got the highest accuracy of 98.288% and FPRate of 1.442% in the flow group level. For trace t2, the accuracy is 97.114% and FPRate is 2.286% in the flow group level.

Figure 10. The FPRate of P2P traffic identification phase.

The ratio between #dPort and #dIP (H7)

FPRate

4.5 Compared to existing approaches

The overall results are presented in Figure 11, which involves three real traces.

For our RTI using trace t1, the accuracy is 96.19% and the FPRate is 3.5% in the flow level. For our RTI using trace t2, the accuracy is 95.262% and the FPRate is 5.549%

in the flow level. For Perenyi [8] using trace t1, the accuracy is 70.95% and the FPRate is 81.77% in the flow level. For John [9] using trace t1, the accuracy is 64.76% and the FPRate is 74.19% in the flow level. Note that the accuracy of John [9]

is worse than that of Perenyi [8] is because that the duration of t1 is too short for John’s heuristics, particularly the heuristic of IP/port pairs, as shown in Table 2.

We also implemented John [9] using t3, the accuracy is 82.824% and the FPRate is 48.647% in the flow level. The results are better than those using t1. However, its FRPate is still high. This is due to their thresholds and heuristics are not suited for our traces. In Table 5, we show the performance evaluation results of different approaches

Accuracy and FPRate (%)

Figure 11. The accuracy and FPRate of our scheme in comparison with those of existing schemes.

Table 5. The performance evaluation results of different approaches using trace t1.

Approach

Chapter 5 Conclusion and Future Work

5.1 Concluding remarks

In this thesis, we have presented three phases with nine heuristics to identify P2P file sharing traffic in real time. Our RTI method operates at three levels: (1) the packet level: using well-known port numbers to filter non-P2P packets, (2) the host level:

finding out which host has used P2P file sharing applications, (3) the flow group level:

identifying P2P file sharing traffic from the P2P-using hosts. These heuristics derived from the behaviors of P2P applications and port numbers information. We have applied our RTI method to real traces without accessing any packet payload information. Experimental results have shown that the proposed RTI had high accuracy of 96.2% and low false positive rate (FPRate) of 3.5% by using only 5 seconds of a real trace. This means the proposed RTI can identify P2P file sharing traffic in real time to facilitate network management for dealing with the problems of internet piracy and unreasonable utilization of network resources.

5.2 Future work

In our RTI scheme, we only considered the issue of identifying P2P file sharing traffic. Our future work will focus on P2P applications classification to identify a specific P2P application (e.q., BitTorrent or eMule, etc.) for achieving more effective network management.

References

[1] Subhabrata Sen , Jia Wang, Analyzing peer-to-peer traffic across large networks, Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment,

November 06-08, 2002, Marseille, France.

[2] “BitTorrent,” [Online]. Available: http://www.bittorrent.com/.

[3] “eMule,” [Online]. Available:

http://www.cs.huji.ac.il/labs/danss/presentations/emule.pdf.

[4] “Internet Assigned Numbers Authority (IANA),” [Online]. Available:

http://www.iana.org/assignments/port-numbers.

[5] S. Subhabrata, S. Oliver and D. Wang, “Accurate, scalable in-network identification of P2P traffic using application signatures,” in Proceedings of Thirteenth International World Wide Web Conference, pp. 512-521, 2004.

[6] Thuy T. T. Nguyen, Grenville Armitage, "A survey of techniques for internet traffic classification using machine learning," in IEEE Communications Surveys &

Tutorials, vol. 10, no. 4, Mar 2008, pp. 56-76

[7] T. Karagiannis, A. Broido, M. Faloutsos, and K. Claffy, “Transport layer identification of p2p traffic,” in Proceedings of the 4th ACM Conference on Internet Measurement, 2004, pp. 121-134.

[8] M. Perenyi, D. Trang Dinh, A. Gefferth, and S. Molnar, “Identification and analysis of peer-to-peer traffic,” in Journal of Communications, vol. 1, no. 7, 2006, pp. 36–46.

[9] W. John and S. Tafvelin, “Heuristics to Classify Internet Backbone Traffic based on Connection Patterns,” In ICOIN ’08: Proceedings of the 22nd International Conference on Information Networking, January 2008, pp.1-5.

[10] Y.D. Lin, C.N. Lu, Y.C. Lai, W.H. Peng, P.C. Lin, “Application classification using packet size distribution and port association,” in Journal of Network and Computer Applications, 2009

[11] F. G. Chou, "P2P Flow Identification," Master Thesis, National Taiwan University of Science and Technology, Taiwan, 2006.

[12] http://en.wikipedia.org/wiki/Receiver_operating_characteristic

[13] L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule, and K. Salamatian, “Traffic classification on the fly,” in ACM Special Interest Group on Data Communication (SIGCOMM) Computer Communication Review, vol. 36, no. 2, 2006.

[14] L. Bernaille, R. Teixeira, and K. Salamatian, "Early Application Identification,"

in International Conference On Emerging Networking Experiments And Technologies, no. 6, December 2006.

[15] T Karagiannis, K Papagiannaki, and M Faloutsos, "BLINC: Multilevel traffic classification in the dark, " Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications, August

22-26, 2005, Philadelphia, Pennsylvania, USA

[16] “Gnutella hosts,” [Online]. Available: http://www.gnutellahosts.com.

[17] S. Baset and H. Schulzrinne, "An analysis of the skype peer-to-peer Internet telephony protocol," in Columbia University Technical Report CUCS-039-04, September 2004, pp.1-11.

[18] Jun Li, Shunyi Zhang, Yanqing Lu, and Junrong Yan, “Real-time P2P traffic identification,” in Global Telecommunications Conference, Nov 2008, pp.1-5.

[19] A. Madhukar and C. Williamson, “A Longitudinal Study of P2P Traffic Identifiaction,” in MASCOT’06, August 2006, pp. 179-188.

[20] Jeffrey Erman, Anirban Mahanti, Martin Arlitt, Ira Cohen, and Carey

learning,” in 26th International Symposium on Computer Performance, Modeling, Measurements, and Evaluation, Volume 64, Issues 9-12, October 2007, pp.

1194-1213.

在文檔中基於連線模式之即時P2P檔案分享的流量辨識方法 (頁 26-0)