Effect of Limited Buffer Sizes - Simulation Environment

Analysis, Simulation, and Design Implications

6.4 Simulation Environment

6.5.5 Effect of Limited Buffer Sizes

As pointed out earlier in this section, the limited buffer size does not impact much on the accuracy of the model. This is verified in Fig. 12 which compares two significantly different sizes, 3 and 1000. From the figure we can see that the core utilization is the same for both sizes when input load does not exceed the system capability. The queue length, which is not shown, for the two cases does not grow noticeably, implying that the system is quite tolerant to the variance of the packet inter-arrival time.

6.6. Summary

This work aims at deriving possible design implications for core-centric network processors by developing an analytical model as well as simulations based on the timed, colored Petri net. The computational intensive VPN application, which has some complex but routine tasks is adopted to explore the benefit from offloading to coprocessors. To date, this work is the first research that practically models the interrupt-driven and busy-waiting schemes over this emerging architecture.

The analytical model is verified to have behaviors quite inline with the simulation (within 1%) and the implementation (within 3%-4%), indicating a satisfactory accuracy for detailed investigation on architectural-level issues which are unlikely to perceive on real implementations. Through both analytical and simulation measures we observe that

by adopting appropriate process run lengths, 2.26 times improvement on the effective core utilization and 20.5% less consumption on the computational resource can be achieved; better results can be have if run lengths are further

Fig. 6.12. Core utilization under two buffer sizes.

0 10 20 30 40 50 60 70 80 90

10 20 30 40

Input load (Mbps)

Core utilization (%).

1000 pkts 3 pkts

differentiated according to the processing time;

by reducing the context switch delay from 300μsec^{to 10}μ^sec^{we can} have 2.6 times advance on the effective core utilization, and the switching overhead and busy-waiting time can be alleviated by as much as 90%; this observation also strongly suggests the use of single process for multiple tasks since 10μsec delay is normally unfeasible for today’s technology;

by incorporating coprocessors for bottleneck task, namely the en/de -cryption, the throughput boosts 7.5 times compared to that of single processor;

under Poisson arrival, the system is quite tolerant to limited buffer size.

We believe the first two findings are useful for system vendors while the others may interest IC vendors. Discovery concluded in this study should be applicable to network processors of similar architecture.

As future work, we plan to extend this approach by considering memory-access intensive applications such as IDP (Intrusion Detection and Prevention). In such extension, memory access operations can be offloaded to coprocessors specifically designed with wide memory bus. To further analyze the potential memory bottleneck, the model can also involve multiple memory modules or multi-port memory supporting concurrent accesses.

Chapter 7 Conclusions

The goals of this dissertation include (1) comparison of the thread allocation schemes in multithreading architecture; (2) design implications and (3) resource allocation strategies, for coprocessors-centric and core-centric network processors implementing different types of applications. For the first, we found that the heterogeneous thread allocation is the best scheme, since the load balance among processors is simple and effective, compared to the homogeneous and the hybrid schemes. It is also resilient to the unbalanced load among threads for unbalance ratios smaller than 1.5. Observations regarding others are categorized and stated as follows.

General NP Design Implications

1. Number of threads per processor: For a sensible P-M ratio, i.e. a ratio close to 1 as in the SF/DS over the IXP1200, the most appropriate number of threads is 5, and should be increased/decreased as the ratio decreases/increases.

2. Solution to memory bottleneck: For solving the memory bottleneck, if any, adding memory banks best improves the performance, though the effectiveness depends heavily on the data structure of the application/algorithm.

Resource Allocation for Coprocessors-centric NPs Implementing Memory Access Intensive Applications

1. Most important architectural factor: Given a certain application and algorithm, the throughput is influenced mostly by the total number of threads as long as the processor utilizations do not exceed 100%.

2. Although enlarging the total number of threads by adding more processors

benefits the throughput, the ME utilization suffers. This is because the load saturating memory is diluted by the increased I, meaning that J, rather than I, should be extended.

3. Most appropriate (I,J) estimation through bottleneck identification. The bottleneck is found to be the SRAM as the I×J exceeds the upperbound k that cost-effectively utilizes the memory. With the upper-bound, we can always estimate a most appropriate (I, J) configuration for the application.

Resource Allocation for Core-centric NPs Implementing Computational Intensive Applications

1. Improvement from offloading: Offloading from the core processor to the coprocessors improves the overall performance for 7.5 times. Moreover, offloading the crypto processing benefits the throughput more than offloading the Ethernet processing.

2. Bottleneck observation: The core tends to be the bottleneck even after offloading.

3. Effect and implications from run length analysis: By adopting appropriate process run lengths, 2.26 times improvement on the effective core utilization and 20.5% less consumption on the computational resource can be achieved;

better results can be had if run lengths are further differentiated according to the processing time;

4. Effect and implications from context switch overhead analysis: By reducing the context switch delay from 300μsec^{to 10}μ^sec we can have 2.6 times advance on the effective core utilization, and the switching overhead and busy-waiting time can be alleviated by as much as 90%; this observation also strongly suggests the use of single process for multiple tasks since 10μsec delay is normally unfeasible for today’s technology.

Bibliography

[AAP04] S. Antonatos, K. G. Anagnostakis, M. Polychronakis, and E. P. Markatos,

“Performance Analysis of Content Matching Intrusion Detection Systems,” Proc. of the International Symposium on Applications and the Internet (SAINT2004), January 2004.

[AC75] A. Aho and M. Corasick, “Efficient string matching: An aid to bibliographic search,” Communications of the ACM, vol. 18 issue 6, P.333-340, 1975.

[ARB02] M. Adiletta, et al., “The Next Generation of Intel IXP Network Processors,” Intel Technology Journal, vol.6 issue 3, 2002.

[Atk95] R. Atkinson, “Security architecture for the Internet protocol,”

RFC1825, IETF Network Working Group, August 1995.

[BDE01] W. Bux, W. E. Denzel, T. Engbersen, A. Herkersdorf, and R. P. Luijten,

“Technologies and Building Blocks for Fast Packet Forwarding,” IEEE Communications Magazine, January 2001.

[BGK⁺99] T. Braun, M. Günter, M. Kasumi and I. Khalil, “Virtual Private Network Architecture,” Technical Report IAM-99-001, CATI, April 1999.

[BH04] H. Bos and K. Huang, “A network instruction detection system on IXP1200 network processors with support for large rule sets,” Leiden Univeristry Techical Report 2004-02.

[BH95] G. Byrd and M. Holliday, “Multithreaded Processor Architectures,”

IEEE Spectrum, vol. 32 issue 8, 1995.

[CB02] P. Crowley and J.-L. Baer, “A Modeling Framework for Network Processor Systems,” Proc. of the Network Processor Workshop in conjunction with Eighth International Symposium on High Performance Computer Architecture (HPCA-8), 2002.

[CFB01] P. Crowley, M. Fiuczynski, and J.-L. Baer, “On the Performance of

Multithreaded Architectures for Network Processors,” UW Technical Report, October 2001.

[CLS⁺04] C. Clark, et al., “A Hardware Platform for Network Intrusion Detection and Prevention," Proc. of the 3^rd Workshop on Network Processors and Applications (NP3), Madrid, Spain, February 2004.

[CM06] D. Comer and M. Martynov, “Building Experimental Virtual Routers with Network Processors,” Proc. of the 2nd International Conference on Testbeds and Research Infrastructures for the Development of Networks and Communities, TRIDENTCOM’06, 2006.

[Com04] D. E. Comer, “Network Systems Design using Network Processors,” p.

282, Prentice Hall, 2004.

[CSI] CSIX-L1: Common Witch Interface Specification,

http://www.npforum.org/csixL1.pdf.

[DFL05] J.D. Davis, C. Fu, and J. Laudon, “The RASE (Rapid, Accurate Simulation Environment) for Chip Multiprocessors,” Proc. of the Workshop on Design, Architecture and Simulation of Chip Multiprocessors, November 2005.

[FV02] M. Fisk and G. Varghese, “Applying Fast String Matching to Intrusion Detection,” SEP’02, 2002.

[FW02] M. Franklin and T. Wolf, “A Network Processor Performance and Design Model with Benchmark Parameterization,” in Network Processor Workshop in conjunction with Eighth International Symposium on High Performance Computer Architecture (HPCA-8), February 2002.

[GKS03] M. Gries, C. Kulkarni, C. Sauer, and K. Keutzer, “Comparing Analytical Modeling with Simulation for Network Processors: A Case Study,” in Proc. of the Design, Automation, and Test in Europe (DATE), 2003.

[INTa] Intel IXP12XX Product Line of Network Processors, http://www.intel.com/ design/network/products/npfamily/ixp1200.htm.

[INTb] Intel IXP425 Network Processor, http://www.intel.com/design/

network/ products/npfamily/ixp425.htm.

[INTc] Intel XScale Microarchitecture, http://www.intel.com/design/

intelXScale.

[INT04] IXP2400 Data Sheet, Intel document number 301164-011, February 2004.

[JK03] E. J. Johnson and A. R. Kunze, “IXP2400/2800 Programming– The Complete Microengine Coding Guide,” Intel Press, April 2003.

[JS97] M. John and S. Smith, “Application-Specific Integrated Circuits,”

Addison-Wesley Publishing Company, ISBN 0-201-50022-1, June 1997.

[JS99] M. John and S. Smith, “Application-Specific Integrated Circuits,”

Addison-Wesley Publishing Company, ISBN 0-201-50022-1, June 1997.

[Kes95] Lawrence Kesteloot, “Porting BSD UNIX to a New Platform,”

January 1995.

[LCL⁺07] Y.-N. Lin, Y.-C. Chang, Y.-D. Lin, and Y.-C. Lai, “Resource Allocation in Network Processors for Memory Access Intensive Applications,” to appear in the Journal of Systems and Software.

[Lek03] P. C. Lekkas, “Network Processors: Architectures, Protocols and Platforms (Telecom Engineering),” McGraw-Hill Professional, ISBN 0071409866, July 2003.

[LHC04] R.-T. Liu, N.-F. Huang, C.-H. Chen and C.-N. Kao, “a fast string-matching algorithm for network processor-based intrusion detection system,” ACM Transactions on Embedded Computing Systems, vol 3 issue 3, P.614-633, August 2004.

[LJ03] B.K. Lee and L.K. John, “NpBench: A Benchmark Suite for Control Plane and Data Plane Applications for Network Processors,” Proc. of the IEEE Int’l Conf. Computer Design (ICCD 03), 2003, pp. 226-233.

[LLP02] S. Lakshmanamurthy, K. Y. Liu, Y. Pun, L. Huston, and U. Naik,

“Network Processor Performance Analysis Methodology,” Intel Technology Journal vol. 6 issue 3, 2002.

[LLL⁺05] Y.-N. Lin, C.-H. Lin, Y.-D. Lin and Y.-C. Lai, “VPN Gateways over Network Processors: Implementation and Evaluation,” Proc. of the 11th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS'05), San Francisco, March 2005.

[LLY⁺03] Y. D. Lin, Y. N. Lin, S. C. Yang, and Y.S. Lin, “DiffServ Edge Routers over Network Processors: Implementation and Evaluation,” IEEE Network, Special Issue on Network Processors, July 2003.

[LW06] J. Lu and J. Wang, “Analytical performance analysis of network-processor-based application designs,” Proc. of the 15th International Conference on Computer Communications and Networks (IC3N06), Arlington, VA, Oct. 2006. IEEE Press, Pages 33-39.

[MOT] Motorola C-5 network processor, http://e-www.motorola.com/.

[Mur89] T. Murata, “Petri Nets: Properties, Analysis and Applications,”

Proceedings of the IEEE, vol. 77, no. 4, 1989.

[Net] The NetBSD Project, http://www.netbsd.org/.

[NFS04] D. Nussbaum, A. Fedorova, and C. Small, “An Overview of the Sam CMT Simulator Kit,” Technical Report of Sun microsystems, June 2004.

[NGG93] S. S. Nemawarkar, R. Govindarajan, G. R. Gao, and V. K. Agarwal,

“Analysis of Multithreaded Multiprocessor Architectures with Distributed Shared Memory”, Proc. of the Fifth IEEE Symposium on Parallel and Distributed Processing, Dallas, pp.114-121, 1993.

[NSH02] U. Naik, et al., “IXA Portability Framework: Preserving Software Investment in Network Processor Applications,” Intel Technology Journal, vol.6 issue 3, 2002.

[POS] POS PHY Level 3 Link Reference Design,

http://www.latticesemi.com/products/devtools/ip/refdesigns/pos_phy.cf m.

[PRS04] W. Plishker, K. Ravindran, N. Shah, and K. Keutzer, “Automated Task Allocation on Single Chip, Hardware Multithreaded, Multiprocessor Systems,” Proc. of the Workshop on Embedded Parallel Architectures (WEPA-1), 2004.

[Roe] M. Roesh, “Snort: The open source network intrusion detection system,” http://www.snort.org.

[RJ03] S. T. G. S. Ramakrishna, H. S. Jamadagni, “Analytical Bounds on the Threads in IXP1200 Network Processor,” Proc. of the Euromicro Symposium on Digital System Design (DSD’03), pp. 426-429, 2003.

[RW03] R. Ramaswamy and T. Wolf, “PacketBench: A Tool for Workload Characterization of Network Processing,” Proc. of the 6th IEEE Annual Workshop on Workload Characterization, 2003.

[RWL⁺03] A. V. Ratzer et al., “CPN Tools for Editing, Simulating, and Analysing Coloured Petri Nets,” Proc. of the International Conference on Applications and Theory of Petri Nets, 2003.

[S-BCE90] R. S-B, D. Culler, and T. Eicken, “Analysis of multithreaded architectures for parallel computing,” Proc. of the 2nd Annual ACM Symposium. on Parallel Algorithms and Architectures, 1990.

[SKP01] T. Spalink, S. Karlin, L. Peterson, and Y. Gottlieb, “Building a Robust Software-Based Router Using Network Processors,” Proc. of the 18th ACM Symposium on Operating Systems Principles (SOSP), 2001.

[SMA03] K. Skadron, M. Martonosi, D. August, M. Hill, D. Lilja, and V. S. Pai,

“Challenges in Computer Architecture Evaluation,” IEEE Computer, 2003.

[SPK 03] Niraj Shah, William Plishker, Kurt Keutzer, “NP-Click: A Programming Model for the Intel IXP1200,” Proc. of the 2^nd Workshop on Network Processors (NP-2), held in conjuction with the 9^th

International Symposium on High Performance Computer Architecture (HPCA), 2003.

[TLY⁺04] Z. Tan, C. Lin, H. Yin, and B. Li, “Optimization and Benchmark of Cryptographic Algorithms on Network Processors,” IEEE Micro, vol.

24, no. 5, pp. 55-69, 2004.

[WF00] T. Wolf and M. Franklin, “CommBench: A Telecommunication Benchmark for Network Processors,” Proc. IEEE Int’l Symp.

Performance Analysis of Systems and Software (ISPASS 00), IEEE Press, 2000, pp. 154-162.

[WF06] T. Wolf and M. K. Franklin, “Performance Models for Network Processor Design,” IEEE Transactions on Parallel and Distributed Systems, Vol. 17, No. 6, pp. 548-561, June 2006.

[WM94] S. Wu and U. Manber, “A fast algorithm for multi-pattern searching,”

Technical Report TR94-17, Department of Computer Science, University of Arizona.

[WT01] T. Wolf and J. S. Turner, “Design Issues for High- Performance Active Routers,” IEEE Journal on Selected Areas in Communications, vol. 19, no. 3, 2001.

[ZGF98] W. M. Zuberek, R. Govindarajan, F. Suciu, “Timed Colored Petri net Models of Distributed Memory Multithreaded Multiprocessors,” Proc.

of the Workshop on Practical Use of Coloured Petri Nets and Design, pages 253-270, Aarhus University, June 1998.

在文檔中多執行緒多處理器網路處理器之資源分配--針對計算密集及記憶體存取密集的網路應用程式 (頁 102-112)