Discussion - 嵌入式系統晶片之匯流排與記憶體設計探索

In this section, we are going to discuss about the experiment result. Section 4.3.1 shows how much improvement of our TGs. We can find out that both TGs keep the speedup ratio while the core number increases. However, the simulation speed drops so quickly with core count. This is because the nature behavior of SystemC modeling. SystemC is an event-driven language. Since the behavior times increase, the simulation time also increase fast. The simulation speed of TG-1 and TG-2 getting closer when more cores need to simulate. This is because the transaction behavior on the system becomes more complicated. Memory and bus conflicts would be more when more master on a platform. Transaction behavior will need more percentage of simulation time than single core, especially for TG-2. However, we choose the AXI crossbar interconnect hierarchy in this experiment so the interconnect behavior simulation will not dominate full simulation time. The profiling result shows core’s

“inside” behavior, cache and file access, is more than half simulation time spends on TG-1 simulation. However, this result will change for different benchmark. In conclusion, TG-1 would take about half simulation effort for modeling cache behavior, TG-2 always pays almost 100 % effort on BIU. Obviously, TG-2 has simplified all cores’ internal behaviors.

Besides, the profiling results we show is under the default simulation which has only cycle count analysis. If we have turned on the analysis capability inside the TG model, the simulation speed and profiling result will changed and spend much effort on this functionality.

The off-line traffic generation shows ISS simulation would spend a long time.

Fortunately, we only need one time simulation for one source code. Since exploration would need repeat and repeat simulation, this simulation effort becomes not that important. TG-2 needs to re-simulate for different cache configuration. However, the simulation time is still

short. If design choices are few, the simulation time of cache model is acceptable. The functions below show total simulation time for our TGs and traditional ISS.

Total simulation time (ISS) = M х Full simulation time

Total simulation time (TG-1) = (ISS time + File translation time) +M х Full simulation time Total simulation time (TG-2) = (ISS time + N х (File translation time + Cache time))+ M х

Full simulation time

N is the number of total cache design choices need to explore, M is the times of full system simulation. Average overhead of off-line simulation will be smaller if there are more times of simulation on the full system. On the other hand, the traffic file size is also a serious problem. Traffic file of TG-1 might be amazing huge for large application. While more cores on a platform, TG-1’s traffic file will be a critical overhead for runtime simulation. In a conclusion, TG-1 is suit for large design space because no need to re-simulate. TG-2 is suit for smaller cache design space because it needs to re-simulate. Also, TG-2 is suit for big application benchmark because the smaller traffic size overhead.

5 C ONCLUSIONS

This thesis first address on the SoC design exploration issues and focus on simulation-based exploration methodology. We then target on a successful simulation framework, MPARM [20][21], and introduce how’s the environment set up by SystemC [15]

language. This case shows the full simulation environment is useful for designers to analyze performance of different hierarchy. However, the simulation speed is slow for modern multicore SoC design space exploration. This problem also exists while we rebuild a simulation environment in modern ESL tool [17]. The experiment shows it is still not enough fast. Many previous works focus on speedup simulation. Transaction Level Modeling [16][29]

does help exploration by arising modeling abstraction level but sacrificing simulation precision. TLM-based simulation helps to speedup interconnection behavior modeling but not improve processors’ inside computation behaviors. Traffic Generator could completely simplify processor’s computation modeling. Nevertheless, TG-based simulation usually is not the real case, or TG directly replays last time’s simulation. These two methods both have their

We proposed a TG-based exploration acceleration approach to deal with those problems.

Our TGs combine both TLM’s and traditional TGs’ properties in our framework. Our TGs support multiple on-chip bus protocols, multi abstraction level and cache behavior simulation.

Most of important, our TGs’ transaction behavior is based on real application not the statistical traffic result. Also, our TGs no need to simulate full system for recording traffic.

The propose simulation flow is separated into two phases: off-line traffic generation and ESL simulation. TG-1 solution off-line simulates cores’ ISS behavior and keep cache modeling contain in ESL simulation environment. TG-2 solution off-line simulates cores’ ISS and caches’ behavior and completely simplifies TGs’ modeling in ESL simulation environment.

We supply a tool chain for full simulation framework and set up a traffic format to be used for both TG solutions.

We further verify our TGs’ accuracy compared to the ARM ISS model. Our TGs have at least 90% accuracy compared to ARM ISS model. Then we build up an experiment for measuring simulation speed. Experiment shows our proposed TGs do speedup simulation, TG-1 is about 4 times improvement over ARM ISS, and TG-2 is about 6 times. This proves that our exploration framework could be used for SoC design which has already decided target processor. The simulation profiling shows TG-1 is suit for large design space especially focuses on cache organization and interconnection network co-exploration. TG-2 is suit for design space focuses on interconnection network exploration with fewer cache deign choices.

Our future work is to enhance the modeling capabilities including semaphore interface between TGs to support multicore issues. Cache models for multi-processor data coherent problems and multi-level cache hierarchy supporting. Moreover, the simulation speed could be improved by traffic file compression techniques to lower system overhead.

R EFERENCES

[1] ITRS Roadmap 2007. [Online]. Available:

http://www.itrs.net/Links/2007ITRS/Home2007.htm

[2] M. Keating and P. Bricaud, Reuse Methodology Manual for System-On-A-Chip Designs, 3^rd Edition, Kluwer Academic Publishers, 1996

[3] R. Kumar, V. Zyuban, and D. M. Tullsen, “Interconnections in multi-core

architectures: understanding mechanisms, overheads and scaling,” in Proc. ISCA, 2005

[4] T. T. Ye, L. Benini, and G. D. Micheli, “Packetized on-chip interconnect communication analysis for MPSoC,” in Proc. DATE, 2003

[5] M. Ruggiero, R. Angiolini, F. Poletti, D. Bertozzi, L. Benini, and R.

Zafalon, ”Scalability analysis of evolving SoC interconnect protocols,” Int.

Symposium on System-on-Chip, 2004.

[6] J. L. Hennessy and D. A. Patterson, Computer Architecture A Quantitative Approach, 2^nd Edition, Morgan Kaufman Publishers, 1996

[7] Y. S. Cho, E. J. Choi, and K. R. Cho, “Modeling and analysis of the system bus latency on the SoC platform,” in Proc. SLIP, 2006

[8] A. Muttreja, A. Raghunathan, S. Ravi, and N. K. Jha, “Automated

energy/performance modeling of embedded software” in Proc. DAC, 2004 [9] W. T. Shiue and C. Chakrabarti, “Memory design and exploration for low power,

embedded systems,” Journal of VLSI Signal Processing, vol. 29, 2001

[10] T. D. Givargis, F. Vahid, and J. Henkel, “Fast cache and bus power estimation for parameterized system-on-a-chip design,” in Proc. DATE, 2000

[11] A. Asaduzzaman, I. Mahgoub, H. Kalva, R. Shankar, and B. Furht, “Cache

optimization for mobile devices running multimedia applications,” in Proc. ISMSE, 2004

[13] E. Ipek, S. A. McKee, K. Singh, R. Caruana, B. R. Supinski, and M. Schulz,

“Efficient architectural design space exploration via predictive modeling,” ACM Transactions on Architecture Code Optimization, vol. 4, 2008

[14] T. Givargis, F. Vahid, and J. Henkel, ”System-level exploration for Pareto-optimal configurations in parameterized System-on-a-Chip,” IEEE Transactions on VLSI Systems, vol. 10, 2002

[15] T. Grötker, S. Liao, G. Martin, and S. Swan, System Design with SystemC, Kluwer Academic Publishers, 2002.

[16] L. Cai and D. Gajski, “Transaction level modeling: an overview,” in Proc.

CODES+ISSS, 2003

[17] CoWare Inc. Platform Architect. [Online]. Available:

http://www.coware.com/products/

[18] ARMLtd. RealView MaxSim. [Online]. Available:

http://www.arm.com/products/DevTools/MaxSim.html

[19] SynopsysInc. SystemStudio. [Online]. Available: http://www.synopsys.org [20] L. Benini, D. Bertozzi, A. Bogliolo, F Menichelli, and M. Olivieri, “MPARM:

exploring the multi-processor SoC design space with SystemC,” Journal of VLSI Signal Processing Systems, vol. 41, 2005

[21] M. Loghi, F. Angiolini, D. Bertozzi, L. Benini, and R. Zafalon, “Analyzing on-chip communication in a MPSoC environment,” in Proc. DATE, 2004

[22] IEEE 1666 SystemC Language Reference Manual. [Online]. Avaliable:

http://www.systemc.org/

[23] Open Core Protocol International Partnership (OCP-IP). [Online]. Avaliable:

http://www.ocpip.org/home

[24] S. Boukhechem, E. Bourennane, and H. Samahi, “Co-simulation platform based on SystemC for multiprocessor system on chip architecture exploration,” in Proc.

ICM, 2007

[25] Wishbone bus. [Online]. Avaliable: http://www.opencores.org/

[26] R. B. Atitallah, S. Niar, S. Meftali, and J. L. Dekeyser.,” An MPSoC performance estimation framework using transaction level modeling,” in Proc. RTCSA, 2007 [27] A. Donlin, “Transaction level: flows and use models,” in Proc. CODES+ISSS, 2003

[28] S. Pasricha, N. Dutt, and M. B. Romdhane, “Fast exploration of bus-based communication architectures at the CCATB abstraction,” ACM Transactions on Embedded Computing Systems, vol. 7, 2008

[29] S. Pasricha, ”Transaction level modeling of SoC with SystemC 2.0,” In Synopsys User Group Conference, 2002

[30] G. Strano, C. Pistritto, L. Benini, G. Strano, and C. Pistritto, “Capturing the interaction of the communication, memory and I/O subsystems in memory-centric industrial MPSoC platforms,” in Proc. DATE, 2007

[31] T. Risset, A. Fraboulet, and A. Scherrer, “Automatic phase detection for stochastic on-chip traffic generation,” in Proc. CODES+ISSS, 2006

[32] Soclib simulation enviroment. [Online]. Avaliable: https://www.soclib.fr/trac/dev [33] S. Mahadevan, F. Angiolini, M. Storgaard , R. G. Olsen, J. Sparso, and J. Madsen, “A

network traffic generator model for fast Network-on-Chip simulation,” in Proc. DATE, 2005

[34] ARMLtd. RVDS. [Online]. Available:

http://www.arm.com/products/DevTools/RealViewDevSuite.html

[35] Dinero IV [Online]. Available: http://pages.cs.wisc.edu/~markhill/DineroIV/

[36] MSCSim [Online]. Available: http://www.mscsim.com/

作者簡歷

顏于凱，1984 年 6 月 30 日出生於高雄縣。2006 年取得國立交通大學電子工程學系

學士學位，並繼續在國立交通大學電子工程研究所攻讀碩士。2008 年在劉志尉教授指導

下，取得碩士學位。本篇論文「嵌入式系統晶片之匯流排與記憶體設計探索」為其碩士論文。

在文檔中嵌入式系統晶片之匯流排與記憶體設計探索 (頁 76-85)