Conclusions and Future Works - 低內容交換失誤率之轉換搜尋緩衝器與其非同步電路實作之探討

In this thesis, we proposed a novel TLB architecture for asynchronous embedded processors. In addition, we also modeled it with Balsa HDL which is a CSP-based asynchronous HDL. We demonstrated how to transfer the proposed architecture into asynchronous circuits. In this chapter, the conclusions and future works will be summarized.

5-1 Conclusions

The computing devices have enormous changing for the past decades. Only recent years, the embedded systems and mobile devices have been becoming the major trend in computing devices. For the past years, because early applications of these systems are simple, no extra complex operating systems are needed. However, new embedded systems and mobile devices have begun to support very complex operating systems, such as Windows^® Mobile and embedded Linux. Google even tries to provide very powerful software stack platform based on embedded Linux called Android [107]. All these new applications need very efficient supporting for embedded operating systems. Traditional design needs specific microcontroller or processor to execute OS, and other DSP or accelerator processor to boost computing performance. Recently, some designs try to provide an alternative solution. These designs integrate both general purpose processor and DSP or accelerator processor into a single processor, such as cores of Blackfin [108] and TILE processors [109]. All these new trends demonstrate the importance of OS in embedded systems or handheld devices. In order to provide high performance address translation from virtual address to physical address of modern OS, the high efficient TLB design is needed. The TLB misses cause serious performance degradation on modern processors. In addition, the context switching under the multiprogramming OS may cause this problem even more seriously. However, only some studies focus on the context switching issue for embedded processors. In our work, we presented an alternative TLB design to reduce the miss rate in context switching for embedded processors.

In addition, it is widely known that synchronous circuit has some disadvantages, such as clock skew, higher power consumption, worse-case performance, and poor reusability.

However, asynchronous circuit can easily address these problems. In addition, asynchronous circuit has higher reliability and robustness than its synchronous counterparts. In fact, all these are all critical issues for embedded processors or microcontrollers. But it’s very hard to implement digital systems with asynchronous circuits.

In our work, we implemented the proposed TLB controller for the proposed TLB architecture with asynchronous circuits. We implemented our proposed TLB controller with the 4-phase bundled-dada handshaking protocol. The bundled-data model was implemented with Balsa HDL which is a CSP-based asynchronous HDL. With the Balsa HDL, we can focus on the asynchronous architecture and algorithm designs without considering too much on the handshaking protocol issues. In addition, because several target handshaking protocols are supported by the Balsa tools, you don’t need to implement each HDL model for each handshaking protocol. Thus, higher flexibility can be provided. Unfortunately, the synthesized result shows that total equivalent gate count of the TLB controller without memory is 688,560.

That’s really not cheap. However, we also found that the CU and prefetch control parts are not very expensive. It costs only 1,441 equivalent gates, but the TLB memory parts costs 687,119 equivalent gates. That’s not only because we modeled lots of functionalities for this part but also lots of extra memory control circuitry is added by Balsa tool suite. However, we still successfully demonstrated an advanced asynchronous TLB controller than other related works. Thus, the following items are the main features of the proposed asynchronous TLB

Modeled with Balsa HDL, the TLB controller can be synthesized into handshaking protocols supported by Balsa framework.

Simple and clear interface definitions can make the designed be used easily.

Unambiguous separation of each part in real asynchronous design makes verifications of the asynchronous TLB controller easier.

5-2 Future Works

In this thesis, we propose an alternative TLB architecture to reduce miss rate in context switching for asynchronous embedded processor. As mentioned in section 3-2, to estimate miss rate more accuracy the simulator should be integrated with OS. Therefore, new simulator model should be developed for further study. In addition, as mentioned in section 3-1, the performance of TLB not only relies on miss rate but also miss penalty. That’s means the execution time should be taken into consideration. However, because lack of information of processor architecture and memory system, it’s not very easy to estimate it directly. The design should be placed into a real processor.

As mentioned before, most asynchronous processors today are very simple; thus, most of them do not support virtual memory. In our work, we hope to provide a general asynchronous TLB architecture that can be implemented in asynchronous processors. That’s why we modeled our design with Balsa HDL. With high-level asynchronous HDL, the design can be synthesized into all supported handshaking protocols by Balsa tool suite. However, the Balsa tool suite cannot provide the real TLB memory; thus it should be implemented separately. In this work, we only simply use latches to replace the real TLB memory for verification. It’s not reasonable. Thus, this part should be carefully handled in our future work. In addition, as the analysis in section 4-4, besides the functionalities we modeled to control behavior of the TLB memory the extra circuitry added by Balsa tool suite is very huge. The part really should be redesigned manually in the future.

Finally, our goal is to design our own asynchronous RISC core with virtual memory support for embedded systems or handheld devices. In fact, we hope to design asynchronous-based SoC or MPSoC with our own asynchronous processor core. As mentioned in section 2-2-3, there are some studies of asynchronous interconnections and GALS. In fact, the clock issue has been becoming one of the most critical issues in large SoC designs. As mentioned in section 1-2, ideally, asynchronous circuits may make software

“OOP”-style design on hardware possible. Imaging, without the global clock issue, designing SoC might be a little like playing the LEGO^® bricks. Ideally, each asynchronous IPs can be plugged in the design if they “talk” the same “handshake protocol.” That’s why we hope to design our own asynchronous processor core. The design of asynchronous TLB is one of the

critical parts of the asynchronous processor core. In order to verify our future processor core more formally, we’ll suggest a new asynchronous processor design flow that can support not only architecture exploration but also facilitate hardware/software co-design. We’ll discuss this topic in the next section.

5-3 Verification Issue for future work

In traditional synchronous based design, the verification can be easier than that of asynchronous ones. You can verify your design based on the “clock.” That means that you can verify the status of the design based on the clock cycles. Figure 5-1 shows a very simplified VLSI design flow. The design ideas are described in cycle-based functional specification descriptions. Traditionally, the functional specification can be described with C programming language. Thus, the cycle-based simulator can be used to prove the design ideas.

Then, the design will be implemented in RTL/gate-level design. To verify the implementation, the two models will be verified via cycle-by-cycle cross-verification. Finally, the design can be transferred into layout. Certainly, the cycle-based equivalence checking should be done between RTL/gate-level design and layout. On the contrary, without the global clock, each part of the design may work in its own speed and it’s not easy to make sure if the design operates correctly in any specific time. It will be even worse that the operation times of the same component may be also different depending upon the input. That’s especially on most DI/QDI designs. You can be very sure what status should be of your design at 10^th cycle, but how can you do the same thing on system without clock? Imaging in a 2-phase bundled data design and given a specific time, how can you make sure the status should be? As mentioned in section 2-2-2, in such systems each part of the design may begin to operate whether the request or acknowledge signals are rising edge or falling edge. Verifications of different models of asynchronous circuits may also be a good research topic.

We have already pointed out that lots of new issues should be carefully dealt with in developing embedded processors. Because most of these problems can be resolved with asynchronous circuits, that’s why we put lots of efforts in developing asynchronous processors. In addition, because of some new application requirements, new features should be supported by these processors. However, it’s important to do some architectural explorations before these features can be supported. Thus, we’ll suggest a design flow that can be used to design new asynchronous embedded processors from architectural exploration to functional verification. Figure 5-2 shows our new design flow. We’ll introduce the use of architecture description language (ADL). LISA will be selected as our design tool [110].

That’s not only because LISA is the most popular and successful ADL but also it’s a mixed

structural and behavioral ADL. Thus, the design described with LISA can be used to generate simple toolchains including (compiler), assembler, linker, and simulator. It can also be used to generate RTL of Verilog HDL. Thus, hardware/software co-design can be easily achieved.

CoWare^® Inc. now provides a complete GUI IDE based LISA development environment called CoWare^® Processor Designer [111]. With CoWare^® Processor Designer, it makes LISA easy to learn and use. The first, the design specification should be implemented with LISA descriptions manually. Then the CoWare^® Processor Designer can be used to generate toolchains and simulator. It should be noted that in order to achieve the goal of hardware/software co-design the application software can be developed simultaneously. In addition, if the designed architecture is described in structural model, the RTL can also be generated. Though the RTL model generated is not a very efficient implementation, it still can be used as reference synchronous model for evaluation. In fact, after simulator and toolchains can be generated, the performance of designed architecture can be roughly estimated. Then the generated simulator can be used as golden model in order to do cross-verification with new designed asynchronous processor. However, because it’s impossible to do clock-by-clock cross-verification with asynchronous circuits, we suggest using the “instruction-based”

cross-verification. That means we can compare the execution results instruction-by-instruction. With this design flow, we can develop our new asynchronous embedded processor more effectively.

Figure 5-1: Simple VLSI design flow

Figure 5-2: Our asynchronous processor design flow

Reference

[1] J. Liedtke, “Improved Address-Space Switching on Pentium Processors by Transparently Multiplexing User Adress Spaces,” GMD Technical Report, No. 933, German National Research Center for Information Technology, Nov. 1995.

[2] A Wiggins, G Heiser, “Fast Address-Space Switching On The StrongArm SA-1100 Processor,” Technical Report, UNSW-CSE-TR-9906, The University of New South Wales, Austrila, 1999.

[3] A Wiggins, G Heiser, “Fast Address-Space Switching On The StrongArm SA-1100 Processor,” in In Proceedings of the 5th Australasian Computer Architecture Conference (ACAC), 2000, pp. 97 – 104.

[4] I. E. Sutherland and J. Ebergen, “Computers without Clocks,” Scientific American, August 2002, pp. 62-69.

[5] A. Davis and S.M. Nowick, “An Introduction to Asynchronous Circuit Design,”

Technical Report, UUCS-97-013, Computer Science Department, University of Utah, Sep.

1997.

[6] S. Hauck, “Asynchronous design methodologies: an overview,” Proceedings of the IEEE, Vol. 83, Issue 1, Jan. 1995, pp.69-93

[7] A. Bink and Mark de Clercq, “ARM996HS Synthesizable CPU with Clockless Technology,” Information Quarterly, Vol. 5, No. 4, 2006, pp. 20-24.

[8] Neil H. E. Weste and K. Eshraghian, Principles of CMOS VLSI Design - A Systems Perspective, 2ed. Addison-Wesley Publishing Co., 1993

[9] P. Gronowski, W. J. Bowhill, R. P. Preston, M. K. Gowan, R. L. Allmon,

High-Performance Microproessor Design, IEEE Journal of Solid-State Circuits, Vol. 33, No.

5, pp. 676-686, May 1998.

[10] D. R. Gonzales, Micro-RISC Architecture for the Wireless Market, IEEE Micro, Vol.

19, No. 4, pp. 30-37, July-August 1999.

[11] R. Y. Chen, N. Vijaykrishma, M. J. Irwin, “Clock Power Issues in System-on-a-Chip Designs,” in Proceedings of the IEEE Computer Society Workshop on VLSI'99, 1999, pp. 48.

[12] D. Duarte, V. Narayanan, and M. J. Irwin, “Impact of Technology Scaling in the Clock

System Power,” in Proceedings of the IEEE Computer Society Annual Symposium on VLSI, 2002, pp. 59.

[13] T. Mudge, “Power: A First-Class Architectural Design Constraint,” IEEE Computer, Vol. 34, No. 4,pp. 52-58, April 2001.

[14] B. Zhai, D. Blaauw, D. Sylvester, and K. Flautner, “Theoretical and practical limits of dynamic voltage scaling,” in Proceedings of the 41st annual Design Automation Conference, pp. 868-873, 2004.

[15] J. V. Woods, P. Day, S. B. Furber, J. D. Garside, N. C. Paver, and S. Temple,

“AMULET1: an asynchronous ARM microprocessor,” IEEE Trans. Computers, Vol. 46, April 1997, pp. 385-398.

[16] S. B. Furber, J. D. Garside, S. Temple, J. Liu, P. Day, N. C. Paver, “AMULET2e: an asynchronous embedded controller,” in Third International Symposium on Advanced Research in Asynchronous Circuits and Systems, April 1997, pp. 290 – 299.

[17] S. B. Furber, J. D. Garside, P. Riocreux, S. Temple, P. Day, J. Liu, and N. C. Paver,

“AMULET2e:An Asynchronous Embedded Controller,” Proceedings of the IEEE, Vol. 87, Issue 2, Feb. 1999, pp. 243 – 256.

[18] S. B. Furber, D. A. Edwards and J. D. Garside, “AMULET3: a 100 MIPS Asynchronous Embedded Processor”, in Proceedings of the International Conference on Computer Design, 2000, pp. 329-334.

[19] H. v. Gageldonk, D. Baumann, K. van Berkel, D. Gloor, A. Peeters, and G. Stegmann, An asynchronous low-power 80c51 microcontroller, in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 96–107, 1998.

[20] A. Bink and R. York, “ARM996HS: The First Licensable, Clockless 32-bit, Processor Core,” IEEE Micro, Vol. 27, Issue 2, pp. 58-68, March-April, 2007.

[21] S. Rusu, S. Tam, H. Muljono, J. Stinson, D. Ayers, J. Chang, R. Varada, Ma. Ratta, S.

Kottapalli, “A 45nm 8-Core Enterprise Xeon^® Processor,” in IEEE International Solid-State Circuits Conference, ISSCC, 2009.

[22] A. Silberschatz, P. Galvin,G. Gagne, Operating Systems Concepts, 7^th ed. John Wiley &

Sons, 2005.

[23] B. Jacob and T. Mudge, “Virtual Memory: Issues of Implementation,” IEEE Computer,

Simulation and Measurement,” ACM Trans. on Computer Systems. Vol. 3, 1985, pp. 31-62.

[27] Michael J. Flynn, Computer Architecture – Pipelined and Parallel Processor Design, Jones and Bartlett Publishers, Boston, 1995.

[28] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, 3^rd ed, Morgan Kaufman, 2006.

[29] J. P. Shen, M. H. Lipasti, Modern Processor Design: Fundamentals of Superscalar Processors, McGraw-Hill Professional, 2004.

[30] Advanced Micro Devices, Inc., AMD64 Technology- AMD64 Architecture Programmer’s Manual Volume 2: system Programming, AMD, September 2006.

[31] Intel Corp.,Intel^64 and IA-32 Architectures: Software Developer’s Manual Vol. 3A:

System Programming Guide Part 1, Intel Corp., March 2009.

[32] Intel Corp.,TLBs, Paging-Structure Caches, and Their Invalidation: Application Note, Intel Corp., 2008.

[33] M. Talluri, M. D. Hill, Y. A. Khalidi, “A new page table for 64-bit address spaces,”

ACM SIGOPS Operating Systems Review, Vol. 29, Issue 5, Dec. 1995, pp. 184 – 200.

[34] MIPS Technologies, Inc., MIPS R10000 Microprocessor User’s Manual, Ver. 2.0,

[37] Todd Austin, SimpleScalar LLC, http://www.simplescalar.com/ (2009/06)

[38] Advanced Micro Devices, Inc., Software Optimization Guide for AMD Athlon^TM 64 and AMD OpteronTM Processors, AMD, September 2003.

[39] J. M. Tendler, J.S. Dodson, J.S. Fields Jr, H. Le, B. Sinharoy, POWER4 System Microarchitecture, IBM J. RES. & DEV. 46, 2002.

[40] Intel^® Corp., Pentium^® Pro Family Developer’s Manual Vol. 3–Operating System Writer’s Guide, Intel Corp., Dec. 1995.

[41] Intel Corp., Intel^® Itanium^® Architecture Software Developer’s Manual Vol. 2: System Architecture Rev. 2.2, Intel Corp., Jan. 2006.

[42] Advanced Micro Devices, Inc., AMD64 Architecture Programmer’s Manual Vol. 2:

System Programming Rev. 3.14, Advanced Micro Devices, Inc., Sep. 2007.

[43] M. Talluri, Use of Superpages and Subblocking in the Address Translation Hierarchy , Ph.D. thesis, Dep. Of CS, University of Wisconsin at Madison, 1995.

[44] M. Talluri and M. Hill. “Surpassing the TLB Performance of Superpages with Less Operating System Support,” in Proceedings of the Sixth Int’l Conference on Architectural Support for Programming Languages and Operating Systems, 1994, pp.171–182.

[45] M. Talluri, Shing Kong, Mark D. Hill, and David A. Patterson. “Tradeoffs in Supporting Two Page Sizes,” In Proceedings of the 19th Annual Int’l Symp. on Computer Architecture, May 1992, pp.415-424.

[46] T. H. Romer, W. H. Ohlrich, A. R. Karlin, and B. N. Bershad, “Reducing TLB and Memory Overhead Using Online Superpage Promotion,” in Proceedings of the 22nd Annual Int’l Symp. on Computer Architecture, 1995, pp.176-187.

[47] Jung-Hoon Lee, Jang-Soo Lee, and Shin-Dug Kim, “A dynamic TLB management structure to support different page sizes,” Proceedings of the Second IEEE Asia-Pacific Conference on ASICs, 2000, pp. 299-302.

[48] Jung-Hoon Lee, Jang-Soo Lee, She-Woong Jeong, and Shin-Dug Kim, “A Banked-Promotion TLB For High Performance and Low Power,” Proceedings of the 2001 International Conference on Computer Design, 2001, pp. 118-123.

[49] M. Swanson, L. Stoller, and J. Carter, “Increasing TLB Reach Using Superpages Backed by Shadow Memory,” Proceedings of the 25th Annual International Symposium On Computer Architecture, 1998, pp. 204-213.

[50] Zhen Fang, Lixin Zhang, John B. Carter, Wilson C. Hsieh, and Sally A. Mckee,

“Reevaluating Online Superpage Promotion with Hardware Support,” in Proceedings of the 7th Int’l Symp. on High-Performance Computer Architecture, 2001, pp.63-72.

[51] C. H. Park, J. Chung, B. H. Seong, Y. Roh, and D. Park, “Boosting Superpage Utilization with the Shadow Memory and the Partial-Subblock TLB,” in Proceedings of the 14th international conference on Supercomputing, 2000, pp. 187-195.

[52] David Channon and David Koch, “Performance Analysis of Re-configurable Partitioned TLBs,” Proceedings of the 30th Hawaii International Conference on System Sciences, 1995, Vol. 5, pp.168-177.

[53] T. Juan, T. Lang, J. J. Navarro, “Reducing TLB power requirements,” in International Symposium on Low Power Electronics and Design, 1997, pp. 196-201.

[54] Y. Lee, T. Lee, S. An, and Y. Lee, “Indirectly-compared cache tag memory using a share tag in a TLB,” IEE Electronics Letters, Vol. 33, No21, 1997, pp. 1764-1766.

[55]Y. Lee, T. Lee, S. An, and Y. Lee, “Shared tag for MMU and cache memory,” in International Semiconductor Conference, CAS'97, Vol. 1, Oct. 1997, pp. 77-80.

[56] A. Saulsbury, F. Dahlgren, and P. Stenstrom, “Recency-Based TLB Preloading,”

Proceedings of the 27th International Symposium on Computer Architecture, 2000, pp.117-127.

[57] G. B. Kandiraju and A. Sivasubramaniam, “Going Distance for TLB Prefetching: An Application-driven Study,” in Proceedings of the 29th Annual International Symposium on Computer Architecture, 2002.

[58] W. A. Clark, “Macromodular computer systems,” in Proceedings of the April 18-20, 1967, Spring Joint Computer Conference (AFIPS Joint Computer Conferences), April 18-20, 1967, pp. 335-336.

[59] A. J. Martin, “The limitations to delay-insensitivity in asynchronous circuits,” in Proceedings of the sixth MIT conference on Advanced research in VLSI, 1990, pp. 263-178.

[60] A. J. Martin, Programming in VLSI: from communicating processes to delay-insensitive circuits, University Of Texas At Austin Year Of Programming Series, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1991.

[61] D. E. Muller and W. S. Bartky, “A theory of asynchronous circuits,” in Proceedings of

an International Symposium on the Theory of Switching, The Annuals of Computation Laboratory of Harvard University, Vol. 29, Part I, Harvard University Press, Cambridge, 1959, pp. 204-243.

[62] J. B. Dennis and S. S. Patil, “Speed Independent Asynchronous Circuits,” in Proceedings of the Fourth Hawaii International Conference on System Sciences, 1971, pp.

55-58.

[63] D. Misunas, “Petri nets and Speed Independent Design,” Communications of the ACM, Vol. 16, Issue 8, August 1973, pp.474-481.

[64] C. L. Seitz, System Timing Introduction to VLSI Systems Ch.7, Addison-Wesley Pub.

Co., 1980.

[65] Y. T. Chang, M. C. Huang, W. M. Cheng, H. Y. Tsai, C. J. Chen, F. C. Cheng,

“Self-Timed Torus Network with 1-of-5 Encoding,” in Proceedings of the 13th IEEE International Symposium on Consumer Electronics, Kyoto, Japan, May 25-28, 2009.

[66] J. Sparsø and S. Furber, Principles of asynchronous circuit design – a systems prospective, Kluwer Academic Publishers, London, 2001, pp. 11-25.

[67] Chris J. Myers, Asynchronous Circuit Design, John Wiley & Sons, Inc.,2003.

[68] D. Muller and W. Bartky, “A theory of asynchronous circuits,” in Proceedings of an International Symposium on the Theory of Switching, April 1959, pp. 204-243.

[69] J. Gunawardena, “A generalized event structure for the Muller unfolding of a safe net,”

in Proceedings of the 4th International Conference on Concurrency Theory, June, 1993, pp.

278-292.

[70] C. J. Chen, W. M. Cheng, H. Y. Tsai, and J. C. Wu, “A Quasi-Delay-Insensitive Microprocessor Core Implementation for Microcontrollers,” Journal of Information Science and Engineering, Vol. 25, No. 2, March 2009, pp. 543-557.

[71] I.E. Sutherland, “Micropipelines,” Turing Award Lecture, Communications of the ACM, Vol.32, Number 6, June 1989, pp 720-738.

[72] E. Brunvand, “The NSR Processor,” in Proceeding of the 26th Hawaii International

在文檔中低內容交換失誤率之轉換搜尋緩衝器與其非同步電路實作之探討 (頁 87-102)