應用於無線影像娛樂系統的隨選記憶體系統

(1)

國

立

交

通

大

學

電子工程學系電子研究所

碩

士

論

文

應用於無線影像娛樂系統的隨選記憶體系統

On-Demand Memory System for Wireless Video

Entertainment Systems

研究生：張雍

指導教授：黃威教授

(2)

應用於無線影像娛樂系統的隨選記憶體系統

On-Demand Memory System for Wireless Video

Entertainment Systems

研究生：張雍 Student：Yung Chang

指導教授：黃威教授 Advisor：Prof. Wei Hwang

國立交通大學

電子工程學系電子研究所

碩士論文

A Thesis

Submitted to Department of Electronics Engineering & Institute of Electronics College of Electrical Engineering and Computer Engineering

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Master

in

Electronics Engineering July 2010

Hsinchu, Taiwan, Republic of China

(3)

應用於無線影像娛樂系統的隨選記憶體系統

學生：張雍

指導教授：黃威教授

國立交通大學電子工程學系電子研究所

摘要

隨著人們對於無所不在的無線高速資料傳輸多媒體影音需求逐年增加，邁向多核心、多執行及多系統融合平台才有辦法達到未來的需求。然而，多核心平台需要一個的記憶體系統來提供足夠的資料頻寬以及良好的記憶體管理機制。在本論文中，我們提出了適合於多核心平台的高效能、低功率隨選記憶體系統。並將其應用在無線影像娛樂系統上。在所提出的隨選記憶體系統中，主要包含了分散式記憶體管理器以及集中式記憶體管理器。在分散式記憶體管理器中，我們提出了一個借取 (borrowing) 機制，此機制可以動態地分配記憶體資源給晶內網路封包的暫存使用，以減少處理單元暫停的情況。而在集中式記憶體管理器中，所提出的適應性快取控制機制可以根據不同處理單元的記憶體存取特性來分配不一樣記憶體資源。此外，在集中式記憶體管理器中也建構了一個外部記憶體存取介面來有效地存取晶外記憶體。另外，針對應用於無線影像娛樂系統上的可階式視訊編碼(Scalable Video Coding)，我們提出了預取(pre-fetch)資料的機制和有效率的動態記憶體(DRAM) 資料安排的機制來減少快取記憶體的失誤率以及動態記憶體的能源消耗。並利用在集中式記憶體中的適應性快取控制，可讓系統達到最佳的記憶體使用率。

(4)

On-Demand Memory System for Wireless Video

Entertainment Systems

Student : Yung Chang

Advisors : Prof. Wei Hwang

Department of Electronics Engineering & Institute of Electronics

National Chiao-Tung University

ABSTRACT

With increasing demands on ubiquitous wireless high-data-rate multimedia services, it is critical to have efficient processing capability and a merging multi-task system to sustain the growth. Therefore, a well-organized memory system can provide enough bandwidth and optimize memory managements. In this thesis, an on-demand memory system is presented to overcome the challenges in the multi-task and heterogeneous multi-core system design. The proposed on-demand memory system, consisting of distributed and centralized memory management units (MMUs), provides energy-efficient memory-centric on-chip data communication for wireless video entertainment systems.

Distributed MMUs (d-MMUs) can dynamically allocate the memory resource for network data buffering to reduce the stall of processor elements based on the proposed borrowing mechanism. Furthermore, the c-MMU manages centralized on-chip memories (L2 cache) and off-chip memories. For different memory requirement of the processor elements in the system, adaptive memory resource allocation is applied via the proposed adaptive cache control. Additionally, in order to access off-chip DRAM efficiently, an external memory interface is designed in c-MMU. By considering the characteristics of the wireless video data, an inter-layer pre-fetch mechanism and an efficient data allocation scheme are proposed to reduce the cache miss rate and memory energy consumptions for Scalable Video Coding (SVC).

(5)

Acknowledgements

我要感謝我的指導教授黃威教授這兩年對我的指導和鼓勵，在研究過程中提供了很多方向和指引，才讓我的研究可以順利完成，特別感謝老師能讓我同時學習到記憶體系統，多媒體，與系統整合的領域，讓我這兩年的研究雖然辛苦但是充滿了挑戰及樂趣。另外要特別的感謝就是跟我同一個團隊的老師，學長和同學。特別感謝實驗室的黃柏蒼學長、王湘斐學長在這段研究期間的合作與指導。也要感謝 eHomeII 計畫團隊的黃威教授，黃經堯教授、許騰尹教授、張錫嘉教授、張添烜教授、闕河鳴教授、劉志尉教授、桑梓賢教授的指教，使我有系統整合的機會與經驗。在團隊工作期間，與各個子計畫的同學也互有往來，感謝各位同學的配合與指教，更提供了很多不同的方向的建議。在這也要特別感謝同一個計畫團隊的李國龍博班學長以及陳宥宸同學，在研究合作中給了許多支援與協助。接下來要感謝實驗室的張銘宏學長、謝維致學長、楊皓義學長以及邱議德、謝忠穎、陳璽文、林天鴻同學。在我的研究過程幫助了我很多也教導了我很多，從他們身上得到很多寶貴的建議。最後要感謝我的家人和朋友在研究過程給我的打氣與鼓勵以及關心，讓我的研究過程能順利完成。

(6)

Chapter 1 Introduction ... 1

1.1 Motivation ... 1

1.2 Contributions... 2

1.3 Organization ... 3

Chapter 2 Related Researches of Memory Systems ... 5

2.1 Memory hierarchy ... 5

2.2 Cache... 6

2.2.1 An overview of Cache Memory ... 6

2.2.2 Reconfigurable Cache Techniques and Improvements ... 8

2.3 DRAM... 16

2.3.1 DRAM characteristic ... 16

2.3.2 DRAM controller techniques and Improvements ... 18

2.3.3 Modern DRAM Development ... 22

Chapter 3 Memory-Centric On-chip Data Communication Platform

for Wireless Video Entertainment Systems ... 28

3.1 Motivations ... 28

3.2 Memory-Centric On-chip Data Communication Platform ... 30

3.2.1 Overall Architecture ... 30

3.2.2 Concepts of On-Demand Memory System ... 32

3.3 Wireless Video Entertainment Systems ... 33

3.3.1 Wireless Processing Unit (WPU) ... 36

3.3.2 Medium Access Control (MAC) ... 38

3.3.3 LT Coding ... 39

3.3.4 Scalable Video Coding (SVC) ... 41

3.4 Memory-Centric On-Chip Data Communication Platform for Wireless Video Entertainment Systems... 42

Chapter 4 Hierarchy Memory Management Units for On-Demand

Memory System ... 45

4.1 Distributed Memory Management Unit Organization ... 45

4.1.1 Design of d-MMU... 46

4.2 Centralized Memory Management Unit Organization... 53

4.2.1 Design of c-MMU ... 54

4.2.2 External Memory Interface in DRAM controller ... 60

4.2.3 Simulation Results of the Adaptive Cache... 67

4.3 Summary ... 75

(7)

Entertainment Systems ... 77

5.1 Data Pre-fetch for SVC ... 77

5.1.1 Introduction ... 77

5.1.2 Inter-layer prediction of the SVC... 78

5.1.3 Proposed Inter-layer Pre-fetch Scheme... 80

5.2 Address Translator for SVC ... 83

5.2.1 Introduction ... 83

5.2.2 Centralized MMU with Address Translator ... 84

5.2.3 Data Arrangement ... 85

5.3 Analysis & Simulation Results ... 88

5.3.1 Improvement of adding IPS ... 89

5.3.2 Improvement of adding Address Translator ... 91

5.3.3 Analysis and Simulation Results of Adaptive Cache Control for Wireless Video Entertainment Systems ... 96

5.4 Summary ... 101

Chapter 6 Conclusions and Future Work ... 103

6.1 Conclusions ... 103 6.2 Future Work ... 104

Bibliography ... 106

References of Chapter 1 ... 106 References of Chapter 2 ... 107 References of Chapter 3 ... 110 References of Chapter 4 ... 112 References of Chapter 5 ... 113 References of Chapter 6 ... 116

Vita... 117

(8)

List of Figures

Fig.2. 1 Memory hierarchy... 5

Fig.2. 2 A simple cache memory. ... 7

Fig.2. 3 A four-block cache configured as direct mapped, two-way set

associative, and fully associative. ... 7

Fig.2. 4 Associativity-based partitioning organization for reconfigurable

caches ... 8

Fig.2. 5 Overlapped wide-tag partitioning organization for reconfigurable

caches ... 9

Fig.2. 6 A selective-ways organization. ... 10

Fig.2. 8 Tiles - A physical organization of molecules. ... 11

Fig.2. 9 Different steps in cache access in the molecular cache ... 11

Fig.2. 10 An example of typical CMP cache partitioning ... 12

Fig.2. 11 Basic structure of the reconfigurable Amorphous Cache for

processors with large on-chip cache memories ... 13

Fig.2. 12 Simplified architecture of a DRAM. ... 16

Fig.2. 13 Bank state diagram. ... 17

Fig.2. 14 DDR3 Read command [2.12]. ... 18

Fig.2. 15 DDR3 Write command [2.12] ... 18

Fig.2. 16 State machine for storing page hit history information. ... 19

Fig.2. 17 Interlaced method ... 20

Fig.2. 19 Configurations of different layers of the proposed memory

controller ... 22

Fig.2. 20 DRAM roadmap ... 23

Fig.2. 21 CPU v.s. DRAM performance ... 24

Fig.2. 22 Accesses addressed to same bank ... 25

Fig.2. 23 Accesses addressed to different bank ... 25

Fig.3. 1 Wireless Video Entertainment Systemss... 28

Fig.3. 2 Homogeneous multi-core platform (a) Intel Polaris (b) Tilera

TILEPro64

TM

Processor ... 29

Fig.3. 3 Trend of the data transmitting bandwidth ... 29

Fig.3. 4 Comparison between memory bandwidth, memory capacity and

communication efficiency in multi-core systems ... 30

Fig.3. 5 The architecture of memory-centric on-chip data communication

platform ... 31

Fig.3. 6 Illustration of the memory hierarchy in on-demand memory

system ... 33

(9)

Fig.3. 7 Multi-Task wireless video entertainment system ... 34

Fig.3. 8 Transmitter and receiver block diagram ... 35

Fig.3. 9 Single-FFT Architecture for MIMO Modem ... 36

Fig.3. 10 Single-FFT Architecture for MIMO Modem ... 37

Fig.3. 11 Single-FFT Architecture for MIMO Modem ... 37

Fig.3. 12 MAC Layer Architecture... 38

Fig.3. 13 An example of decidable codewords which BP decoding fails to

decode... 40

Fig.3. 14 Architecture of an SVC encoder ... 41

Fig.4. 1 Block diagram of a local node ... 45

Fig.4. 2 Illustration of read operation ... 46

Fig.4. 3 Illustration of write operation ... 47

Fig.4. 4 Illustration of hiding miss penalty ... 47

Fig.4. 5 d-MMU and efficient Network Interface ... 48

Fig.4. 6 Buffer borrowing interface between NI and d-MMU ... 49

Fig.4. 8 Architecture of the empty memory block searching ... 50

Fig.4. 9 Searching flow chart of the borrowing mechanism in d-MMU . 51

Fig.4. 10 Block diagrams of borrowing mechanism in network interface

... 52

Fig.4. 11 Borrowing control policy of the buffering control ... 52

Fig.4. 12 (a) Execution time under various injection loads and queue sizes

(b) Transferred packets under various injection loads and queue sizes. .. 53

Fig.4. 13 c-MMU block diagram... 54

Fig.4. 15 Illustration of the memory partition... 55

Fig.4. 19 Detail architecture of c-MMU ... 60

Fig.4. 20 Connection of EMI ... 61

Fig.4. 21 Architecture of EMI ... 62

Fig.4. 22 State diagram of EMI Finite State Machines ... 64

Fig.4. 23 bank-miss scheduling ... 65

Fig.4. 24 read / write scheduling ... 66

Fig.4. 25 row-conflict scheduling ... 66

Table.4. 5 Simulation of the bandwidth utilization ... 67

Fig.4. 26 DRAM latency estimation for different situations ... 69

Fig.4. 28 System configuration interface of the System Power Calculator

... 71

Fig.4. 29 Summary of the power measurement result in the System Power

Calculator ... 72

(10)

Fig.4. 32 Total memory energy consumption ... 75

Fig.5. 1 Illustration of inter-layer motion prediction [5.12] ... 79

Fig.5. 2 Illustration of inter-layer residual prediction [5.12] ... 79

Fig.5. 3 Illustration of inter-layer intra prediction [5.12] ... 80

Fig.5. 5 Illustration of the Inter-layer Pre-fetch Scheme ... 82

Fig.5. 6 d-MMU architecture with Pre-fetch Command Generator ... 83

Fig.5. 7 Centralized MMU architecture with Address translator ... 84

Fig.5. 8 Architecture of the DRAM organization ... 85

Fig.5. 9 Conventional mapping scheme for the selected DRAM ... 86

Fig.5. 10 Video frame arrangement of a GOP ... 87

Fig.5. 11 Frame map to memory ... 88

Fig.5. 12 Miss rate of the L1 cache versus L1 cache size ... 89

Fig.5. 13 L1 cache ways v.s. Miss Rate ... 90

Fig.5. 14 Memory access count of L2 Cache... 90

Fig.5. 15 DRAM access count ... 90

Fig.5. 16 L1 cache energy measurement ... 91

Fig.5. 17 DRAM row-miss rate ... 92

Fig.5. 19 DRAM activate power ... 93

Fig.5. 20 DRAM bandwidth utilization ... 93

Fig.5. 21 DRAM energy consumption ... 94

Fig.5. 22 Total Execution cycles ... 95

Fig.5. 23 On-chip cache energy consumption ... 95

Fig.5. 24 Total memory energy consumption ... 95

Fig.5. 25 Video coding performance [5.18] ... 96

Fig.5. 26 SVC memory requirements of different scalable layers for a GOP

... 97

Fig.5. 28 Memory energy consumption for different SVC levels ... 99

Fig.5. 29 Relation between simulation time interval and decoding SVC

level ... 100

Fig.5. 30 Simulation result of total execution cycles ... 100

Fig.5. 31 Simulation result of memory energy consumption ... 100

(11)

List of Tables

Table.2. 1 Cost-performance for various memory technologies ... 6

Table.2. 2 Related work of adaptive caches ... 15

Table.2. 3 The maximum transfer rate for SDR, DDR, DDR2 and DDR3

... 24

Table.2. 4 Number of banks for SDR, DDR, DDR2 and DDR3 ... 25

Table.2. 5 Supply voltages for DDR family ... 26

Table.4. 1 System Specification ... 36

Table.4. 3 Micron`s DDR3 configurations ... 61

Table.4. 4 Common timing parameters of Micron DDR3 SDRAM ... 64

Table.4. 5 Simulation summary ... 67

Table.4. 7 Summary of system and DRAM Configuration ... 70

Table.4. 8 List of simulation information ... 74

Table.4. 9 Memory requirement assumption and corresponding bank

assignment for c-MMU ... 74

Table.5. 1 Selected Micron DDR3 size parameters ... 86

Table.5. 2 Summary of SVC information ... 88

Table.5. 3 List of simulation information ... 98

Table.5. 4 c-MMU bank assignment for wireless video entertainment

systems ... 98

(12)

Chapter 1 Introduction

1.1 Motivation

For development of system on a chip (SoC) and multimedia technologies, amount of data and computing required to be processed increase quickly. Multi-task processing technique is more and more important for integrating various processor elements into a chip [1.1]-[1.3]. Generally, most of systems require the memories for storing. In multi-task environment, memory is center of storage system, and it is the most serious bottle neck because the performance of processor elements is much faster than the memory. Accordingly, the organization of memory system for a multi-task system will affect the system performance dramatically.

In addition, multimedia technologies are usually applied in multi-task systems for video processing. These technologies have not only provided existing applications like desktop video/audio but also spawned brand new industries and services like digital video recording, video-on-demand services, high-definition TV, digital home sever, etc. It generally needs huge memory requirement for high quality or multiple scalable level video processing. The memory system needs to provide enough memory space and high data bandwidth for satisfying the video real-time requirement. In order to provide huge bandwidth requirement for multi-task system, a multilevel memory hierarchy is a well-known design methodology. A well-organized memory hierarchy system can have fast memory access time provided by highest hierarchy level memory and cheap cost per storage bit provided by the lowest hierarchy memory. In addition, the data transfer to off-chip memory is especially important due to the scarce resource of off-chip bandwidth. As many recent studies have shown, the off-chip memory system is one of the primary performance bottlenecks in current systems.

(13)

As the number of processor elements in SoC system increases quickly, the data communication and memory access traffic problem are more and more serious for constructing multi-task or multi-core systems. Especially for the system that have video process requirement such as digital TV, digital home sever or mobile devices. With video processing, a large amount of data needs to be processed and finished in a tightly bounded time. Higher resolution of video processing requires more memory bandwidth for real-time requirement. Furthermore, modern video coding schemes such as scalable video coding (SVC) [1.4] or multi-view coding techniques [1.5] require more memory bandwidth than the conventional coding scheme. Additionally, in a multi-task system, different processor elements may have quite different memory behavior. For instance, video processor element requires large memory but the wireless processor element may be not. It will result in bad memory utilization if traditional memory system is applied in multi-task platform.

How to manage and utilize the memory is the most important issue for constructing a multi-task platform. Accordingly, large amounts of high speed and low power memories are indispensable for multi-task and multi-system emerging. These memories should be able to support diverse memory requirement of different processor elements in a system. Therefore, a memory-centric on-chip data communication platform with on-demand memory system will be proposed in this thesis. The on-demand memory system provides high bandwidth and low power memory accesses for a multi-core platform by powerful memory management units (MMUs). Furthermore, MMUs can support that different memory resources can be assigned for different processor elements according to the memory behavior. Moreover, when decoding the video frames, video decoder generally has have regular memory access characteristics. According to the regular behavior, some techniques can be used for improving the decoding performance.

1.2 Contributions

In this thesis, a memory-centric on-chip data communication platform is presented for merging heterogeneous processor elements into a system, and applied to wireless video entertainment systems. In this platform, on-demand memory system is constructed for dynamically allocating memory resources and efficiently managing

(14)

memory accesses. The contributions of on-demand memory system will be introduced as following.

A. Buffer borrowing mechanism for data communication

In order to reduce the stall caused by network data blocking, a novel buffer borrowing mechanism is proposed to borrow the memory resources for buffering the blocking packets.

B. Adaptive cache control

In multi-task system, different processor elements (PEs) may have different memory requirements at runtime. Proposed c-MMU can support memory resource re-allocation by adaptive cache control scheme. Accordingly, the memory utilization of the system can be improved.

C. External Memory Interface (EMI) for DDR3 DRAM

Modern DDR3 DRAM device is applied for supporting huge data storage. An efficient external memory interface for DDR3 DRAM is constructed in this work.

D. Inter-layer Pre-fetch (IPS) for SVC

In wireless video entertainment systems, SVC technique is used for video coding. IPS is proposed to reduce the miss rate when decoding frames by SVC.

E. Efficient Address Translator (AT) for SVC

A suitable DRAM data allocation for frame data is presented. It can improve the DRAM access efficiency for processing SVC.

1.3 Organization

The organization of this thesis is depicted as following. The related researches of memory systems will be introduced in Chapter 2. In the chapter, the concept of memory hierarchy, the previous work of the reconfigurable cache, DRAM architecture, basic operation of DRAM, DRAM controller and modern DRAM development will be described.

(15)

platform with on-demand memory system for wireless video entertainment application. The development of the wireless video entertainment systems and the concept of on-demand memory system will be introduced.

Chapter 4 presents the design of Distributed and Centralized memory management units (MMUs) which are applied in memory-centric on-chip data communication platform. Buffer borrowing mechanism in distributed MMUs and adaptive cache scheme in centralized MMU are proposed for optimizing the memory resources utilization dynamically in on-demand memory system. To communicate with external memory, an efficient external memory interface will be presented. In addition, the memory latency and energy measurement methods will be introduced in Chapter 4.

Subsequently, a pre-fetch and DRAM data allocation schemes are proposed in Chapter 5 to improve the memory energy efficiency of Scalable Video Coding (SVC) functional block in wireless video entertainment systems. Pre-fetch command generator and address translator are applied in Distributed MMU and Centralized MMU, respectively. With these proposed schemes, the memory energy consumptions including on-chip cache and off-chip DRAM can be reduced significantly for decoding the video frames by SVC function. Finally, the conclusion and future work will be discussed in Chapter 6.

(16)

Chapter 2 Related Researches of Memory Systems

In this chapter, the related research of memory system including cache and DRAM systems will be introduced. Furthermore, the previous work of reconfigurable cache and DRAM controllers will be introduced, too. Firstly, the concept of memory hierarchy will be described in section 2.1. After that, the overview of cache and DRAM systems will be described in section 2.2 and 2.3, respectively.

2.1 Memory hierarchy

Fig.2. 1 Memory hierarchy

In computer or SoC systems, memory elements are necessary for data storage, and the most important development concept is memory hierarchy because a well-organized hierarchy enables the memory system to have both advantages simultaneously which are the fastest memory access time and the cheapest cost per storage bit. The memory hierarchy is base on a principle of locality including temporal and spatial locality. In general, the memory hierarchy is described as a pyramid which is shown in Fig.2. 1 [2.1]. The higher levels have better performance than the lower levels, but the cost per bit is on the contrary. In ideal, the processor element can access the data with the best memory access performance and have large memory space. Nowadays, the hierarchy is formed with Cache(SRAM), DRAM and Disk storage elements. The list of the performance and energy consumption is shown in Table.2.1. So far, there are no storage element can provide low cost, high bandwidth and low latency simultaneously. The memory hierarchy is built to hide the

(17)

negative characteristics and gain the positive characteristics of these memory technologies.

Table.2.1 Cost-performance for various memory technologies

According to different system requirement, the design and configuration of memory hierarchy will different. In the following sections, the previous work of the adaptive cache design and the external memory controller will be introduced.

2.2 Cache

2.2.1 An overview of Cache Memory

In the memory hierarchy system, cache plays an important role because it is the first level of the memory hierarchy. The basic operation can be illustrated by Fig.2. 2. Assume the address width of the processor element is 32-bits, the address can be divided into three parts which are offset, Index and Tag. According to the Index value, the address selects a cache line and then check out the Tag. If the Tag of the address is equal to the Tag bits recoded in the cache line and the valid bit is 1, it means the wanted data is in the cache. The data will be delivered if hit. Note that the valid bit is used to indicate whether an entry contains a valid address or not. If the Tag is different or the valid bit is 0, it means that no requested data in the cache. The wanted data may be stored in the lower level memory. W hen the wanted data is found in the lower level, it would be written back to the cache and update the Tag entries.

(18)

Valid Tag Data 31 30 ……… 13 12 11 ……2 1 0

＝

hit data 20 10 20 32 Index 0 1 2 3 1022 1023 Index Tag

Fig.2. 2 A simple cache memory.

The mapped structure of the above example is called direct mapped because all the memory block address is directly mapped to a single location in the cache. Another extreme mapped method is called fully associative mapped which the memory block can be placed in any location in the cache. To find a wanted block in a fully associative cache, whole entries in the cache must be searched. The hardware cost significantly increases because it needs more number of parallel comparators. The middle mapped scheme between direct mapped and fully associative is called set associative. Fig.2. 3 shows the examples of different associativity structures for a four-block cache.

Tag Data

Tag Data Tag Data

Tag Data Tag Data Tag Data Tag Data

0 1 2 3 0 0 1

1-way set associative (Direct mapped)

2-way set associative

4-way set associative (fully associative)

(19)

2.2.2 Reconfigurable Cache Techniques and Improvements

The best configuration of the cache on a system can be distinct from different application characteristics and design constraints [2.2]. Since no cache organization can fulfill the requirements of all applications [2.3], one way to overcome this problem is to create reconfiguration capabilities in the cache. Reconfigurable caches need some additional mechanisms that enable the on-chip SRAM cache to be dynamically partitioned and reused for other processor element. The aspects of the cache organization can be categorized according to different partitioning method, data consistency process, reconfiguration policy and the reconfigurable cache level [2.4]. In the following subsections, the basic concept of these cache organizations and previous works of the adaptive caches will be introduced.

2.2.2.1 Cache Partition methods

In order to resizing the cache size, the SRAM storage partition mechanism is a key challenge in designing a reconfigurable cache. There are several partition methods shown in below.

Associativity-based partitioning

Fig.2. 4 Associativity-based partitioning organization for reconfigurable caches

The associativity-based partitioning divides the reconfigurable cache into partitions at the granularity of ways of the traditional cache [2.4]. Fig.2. 4 shows the example and the comparison with conventional set-associative cache. This partitioning approach has several advantages. First, the organization only requires few changes to the current set-associative cache organization. The second one is that the different

(20)

requests which address to different partitions can be isolated from each other. However, the drawback of this organization is that the number and granularity of the partitions are limited by the associativity of the cache.

Albonesi [2.5] proposed a selective cache ways method for on-demand cache resource allocation. The technique disables a subset of the ways in the set associative cache to have lower energy consumption. Parthasarathy [2.4] presented the reconfigurable caches for media processing applications, and the associativity-based partitioning mechanism was selected. In contrast to simply turning off some partitions in [2.5], it suggests using the partitions for alternate processor activities to enhance performance. Zhang [2.6] proposed the highly configurable cache architecture for embedded systems. The basic principle is also base on associativity-based partitioning. The cache used a way concatenation technique so that it can be configured by software to be direct-mapped, two-way or four-way set associative.

Overlapped wide-tag partitioning

Another partitioning method is called overlapped wide-tag partitioning [2.4]. The different part to the conventional cache is indicated by the dark-shade regions shown in the Fig.2. 5. This partitioning increases the tag array bit size to support the maximum tag bit variation with various partition sizes. According to this organization, the size of partition can potentially be any size, but generally the size would be limited to be powers of two to have simpler implementation. The main drawback of this partitioning is that the data in all blocks requires be flushed when the resizing occur because the mapping of the address has been changed.

Fig.2. 5 Overlapped wide-tag partitioning organization for reconfigurable caches

(21)

changed, and the cache partitioning method of resizing is similar to the overlapped wide-tag partitioning. After this work, a hybrid selective-sets-and-ways cache organization was proposed [2.8] to enhance the configuration flexibility. Fig.2. 6 and Fig.2. 7 show the basic structures of selective-ways and selective-sets resizable caches respectively. In addition, Ravi Iyer [2.9] proposed a CQoS : a work on heterogeneous caches regions. In its work, the set partitioning technique is applied in his organization schemes.

Fig.2. 6 A selective-ways organization.

Fig.2. 7 A selective-sets organization.

Molecular-based partitioning [2.10]

In many partitioning works, the cache SRAM is divided into several individual sub-caches. We categorize these partitioning methods as the Molecular-based partitioning. The separated caches could dynamically be reorganized according to different application requirements. Vardarajan presented the Morecular Caches which are composed of many small and reconfigurable building blocks called Molecules [2.10]. The design can dynamically adjust the configuration of the cache capacity, set-associativity, and line size. In their design, the cache accessed by a processor is an aggregation of molecules. The Molecular caches support selective enablement of

(22)

molecules according to different application requirements so that the dynamic power dissipation can be reduced. The physical organization of molecules is shown in Fig.2. 8. The „M‟ is the symbol of a molecule. 4-8 tiles are grouped into a tile cluster, and every cluster is associated with a tile controller named Ulmo. It processes the coherence traffic and tile-misses between clusters. Fig.2. 9 shows the cache access method. Each molecule is configured with the Application Space Identifier (ASID) which uniquely identifies a running application. Before any cache operation is performed on the molecules, an ASID match is performed to see if the molecule is eligible to perform the operation.

Fig.2. 8 Tiles - A physical organization of molecules.

Fig.2. 9 Different steps in cache access in the molecular cache

(23)

architectures. A typical allocation in their design is shown in Fig.2. 10. According to different memory resource requirement, the L2 cache banks are separated into eight parts for eight cores.

Fig.2. 10 An example of typical CMP cache partitioning

The sub-caches can be heterogeneous caches. In the CQoS work presented by Ravi Iyer [2.9], the heterogeneous caches technique has been used in its platforms. In addition, Benitez [2.2] presents the Amorphous Cache (AC) which is a reconfigurable L2 on-chip cache, and it is organized by heterogeneous sub-caches. Fig.2. 11 shows the AC structure and maximum cache size is 2MB. There are six sub-caches which the sizes are ranging from 64KB to 1MB. The AC uses configuration registers to organize the cache into different cache size and number of way set-associative. It has eighteen configurations because the cache size can be range from 64KB to 2MB and the set-associative can be 4, 8, and 16-ways.

(24)

Fig.2. 11 Basic structure of the reconfigurable Amorphous Cache for processors with large on-chip cache memories

2.2.2.2 Data Consistency

Another problems need to conquer is data consistency after resizing the cache. Reconfigurable caches need a mechanism to ensure that the data which belongs to a particular processor element resides only in the partition associated with that particular activity [2.4]. Generally there are two approaches for the data consistency which are cache scrubbing and Lazy transitioning. The concept will briefly be introduced as follows.

Cache scrubbing

Cache scrubbing scheme moves all valid data to the new partition parts or lower levels of memory when the reconfiguration happened. At the time of reconfiguration,

(25)

this approach requires examining all the locations of the cache to check for their validity and performing suitable actions on valid data [2.4]. Cache-scrubbing would induce big overhead because of the huge data access. But it can be acceptable when the reconfiguration is infrequent.

Lazy transitioning

When the reconfiguration happened frequently, the other suitable scheme is that the data is lazily moved into its correct partition parts only when it is accessed. In order to achieve the scheme, it needs additional cache line information to indicate the user of the corresponding cache line. According to this information, the access which address to this cache line can be checked. Note that if a miss occur in the appropriate partition, other partitions must need to be checked because the data may laze in other partitions. This method can avoid high overhead with moving large amounts of data when the reconfiguration happened, but it need more state storage and may increase the contention for the other SRAM partition parts.

2.2.2.3 Reconfiguration Policy and Detection

A reconfigurable cache needs a detection mechanism and reconfiguration policy to determine when to reconfigure. The cache reconfiguring strategy can be static or

dynamic strategy. The cache resizing is done prior to the application execution when

using static strategy. Instead of the static strategy, dynamic strategy reconfigure the cache organization when the application runtime. It needs a detection mechanism to dynamically monitor the performance and energy dissipation to determine when to reconfigure and what organization to be chosen. The mechanism can be software or hardware controlled.

According to different organization of the configuration caches, the reconfiguration policy and detection mechanism may be different. Albonesi [2.5] used a software-visible register, called Cache Way Select Register(SWSR), to enable/disable the particular ways. The SWSR was written and read by specific pre-defined instructions. The Performance Degradation Threshold(PDT) measured the performance degradation relative to a cache with all ways enabled. According to the measurement, it can select a suitable way organization for the cache. Kaseridis [2.11]

(26)

used the Mattson`s stack distance algorithm and the concept of Marginal Utility, which originated from economic theory, to be the assignment policy in bank-aware cache partitioning. Benitez [2.2] proposed a Basic Block Vectors(BBV)-based tuning technique to trace the loop characteristics of the program in the runtime, and it dynamically learned the configuration type by holding the previous CPI value.

The related works of the reconfigurable caches are shown in the Table.2. 2.

Work Partitioning mechanism Data consistency Detection mechanism Reconfigurable

cache level Application

[2.2] Molecular-based Cache scrubbing Hardware controlled; Dynamic strategy L2 General purpose [2.4] Associativity-based Cache scrubbing Software

controlled L1 Media processing

[2.5] Associativity-based Lazy transitioning Software controlled; Dynamic strategy L1 General purpose [2.6] Associativity-based N/A Software controlled; Static strategy L1 Embedded System [2.7] Overlapped wide-tag Cache scrubbing Software controlled, Static/Dynamic strategy

L1 I-cache General purpose

[2.8] Hybrid Cache scrubbing Software controlled Static/Dynamic strategy L1 General purpose [2.9] Overlapped wide-tag Molecular-based Cache scrubbing Software controlled dynamic strategy Shared cache Multi-core Network-intensive applications. [2.10] Molecular-based Cache scrubbing Software controlled; Dynamic strategy L2 General purpose multi-core [2.11] Molecular-based N/A Software controlled Dynamic strategy L2 General purpose multi-core

(27)

2.3 DRAM

2.3.1 DRAM characteristic

Dynamic random-access memory(DRAM) have been widely used for providing additional off-chip memory storage capacity. Compare to the SRAM, the circuit of a DRAM cell is “dynamic” because the capacitors storing electrons are not perfect devices, and their eventual leakage requires that, to retain information stored there, each capacitor in the DRAM must be periodically refreshed [2.1]. However, the cost per bit is much cheaper than the SRAM. In the memory hierarchy, DRAM is a level below the on-chip SRAM (cache).

2.3.1.1 Basic DRAM architecture

DRAM architecture is usually composed of the data memories, address decoders, row buffer, mode register, data buffer. Fig.2. 12 shows a simplified block diagram. In this example, four banks share the address bus and command bus. Each bank has its own row decoder, column decoder, and sense amplifier. The mode register stores the DRAM operation mode, including burst length (BL), column address strobe latency (CL), and burst type, etc. Users can set the value of the mode register through address bus with proper command.

BANK0

SA & Row Buffer Column Decoder R ow d ec od er Mode Register D ata b uff er ADDR

(28)

2.3.1.2 DRAM command and operation

The normal commands and its operation used in DRAM will be introduced as follows.

NO OPERATION (NOP):

The NOP command can prevent unwanted commands from being registered during idle or wait states. Operations already in progress are not affected.

ACTIVE:

This command is used to open a row in a particular bank. The row remains open for accesses until a PRECHARGE command is issued to that bank.

READ/WRITE:

The read/write command is used to initiate a read/write access to an active row, if auto precharge is selected, the row being accessed will be closed at the end of read.

PRECHARGE:

The precharge command is used to deactivate the open row in a particular bamk. The bank will be available for a subsequent row access a specified time (tRP).

REFRESH:

The refresh command can be used to retain data in the DRAM.

A memory access operation, which simplified state diagram is depicted in Fig.2. 13, contains three operation including row activation (ACTIVE), column access (read/ write), and precharge.

IDLE

ACTIVE

PRECHARGE PRECHARGE

ROW ACTIVE

COLUMN ACCESS

(29)

The active command opens a particular row in one of the bank, and copies the row data into the row buffer. The active command needs a latency period called tRCD to accomplish this operation. Then, after tRCD delay a column access command (read / write) can be issued to sequential access data or single data according to the burst length and burst type set in the mode register. During the tRCD time, no other commands can be issued to the bank. However, commands to other banks are permissible due to the parallel processing capability of each bank. For read operation, the valid data-out from the starting column address will be available following the CAS latency after the read command, as shown in Fig.2. 14. For write command in DDR3 SDRAM, the write data must wait a write latency and then sent to the DRAM. The timing diagram is shown in Fig.2. 15. Finally a precharge command must be issued before opening a different row in the same bank.

Fig.2. 14 DDR3 Read command [2.12].

Fig.2. 15 DDR3 Write command [2.12]

2.3.2 DRAM controller techniques and Improvements

According to different applications or systems, the memory controllers can be categorized into two classes which are particular-purpose and general purpose memory controller. The particular-purpose memory controller serves one kind of specific application to reduce the memory access latency. In many multimedia applications, the advanced video processes need huge data storage space. In order to

(30)

support the real-time video environment, the system needs external memory storage to store the image frame data or motion information. But the memory access speed is much slower than the processor unit execution speed. Many researchers have shown the well memory management method according to the regular memory access behavior in video process can significantly improve the overall system performance.

Base on the different specific applications, there have several approaches been proposed to increase the efficiency of memory access for video coding applications. Kim memory interface architecture [2.13] reorganizes data arrangement in synchronous DRAM to increase the row-hit rate. Park proposed a memory node control approach [2.14] for HDTV video decoder. It uses history-based prediction to predict the next command is row-hit or row-miss. If it predicts the next command is row-miss, it will pre-charge the current bank. If row-hit, the current row will stay in the active state. The prediction is implemented by a finite state machine which shown in Fig.2. 16.

Fig.2. 16 State machine for storing page hit history information.

Chang proposed a two-layer external memory management unit [2.15] for H.264/AVC decoder. The memory management unit consists of two layers. The first layer is the address translation which provides an efficient pixel data arrangement to reduce the row-miss occurrence. The second layer is the external memory interface (EMI). In the address translation layer, the address translation machine uses a novel data arrangement which is suitable for H.264/AVC decoder to increase the memory bandwidth and reduce the power consumption. In order to minimize the number of active and pre-charge, chessboard-based arrangement memory mapping is presented as shown in Fig.2. 17. It is further compounded with the fact that Luma and Chroma are placed interleaved. The interlaced memory mapping method put the luminance

(31)

block and chrominance block in the same row of the bank. Because the decoder accesses a chrominance block after each luminance block, it doesn‟t need to re-active the row when accessing the chrominance block. Thus, it leads to the latency and power consumption reduced. To decrease the latency of row-miss and bank-miss status, the physical addresses produced by AT are stored in specific command FIFO. Then the command FIFO can auto-detect whether the row-miss or bank miss would happen. The architecture of command FIFO is shown in Fig.2. 18. The incoming address is compared with PAR. If bank address and column address are the same as PAR, we set hit bit of the previous command to one. It leads to auto-precharge capability turned off. Otherwise, the hit bit remains zero such that auto-pre-charge capability turns on to reduce the latency of row-miss.

Fig.2. 17 Interlaced method

Fig.2. 18 Two architectures of command FIFO. B equals to one means bank hit. R equals to one means row hit.

(32)

proposed a history-based memory mode controller, Zhu [2.16] and Hongqi [2.17] adjust the page size. These designers are trying to reduce the total row-miss and minimize the DRAM access latency. In the advanced memory controller, rearrange data is necessary to reduce the access latency. In addition, the advance video coding standard, H.264/AVC, provides several new coding tools including sub-pixel inter-prediction, variable block size motion compensation. Although these techniques can reduce bit-rate and improve the video quality, they require huge memory bandwidth to fetch additional reference pixel for motion compensation(MC) and interpolation. Fortunately, designers can use data reuse scheme to reduce the sub-pixel MC data loading bandwidth from DRAM. Interpolation window reuse(IWR) scheme was [2.18] proposed to reduce data access for the overlapped data. Li [2.19] proposed a cache-based architecture to reuse intra-MB overlapped data, and Chuang [2.20] also proposed an IWR-liked with N-way associative cache architecture to reuse inter-MB and inter-MB overlapped data.

In order to improve the bandwidth, Kang [2.21] and Heithecker [2.22] proposed multi-channel memory controller. The concept of the multi-channel can be applied to the general purpose memory controller. In the SoC system design, a variety of processor elements integrate into a chip. Different applications have different memory needs, finding a single topology that fits well with all applications is difficult, in order to adopt a variety of the functions, flexible and adaptable memory control is more and more important in SoC systems. Furthermore, in the multi core systems, the multi-channel memory controller will be needed to support high bandwidth and provide different application memory requirement. There are many researches develop many kind of efficiency memory systems. Lee [2.23] presents a multilayer, quality-aware memory controller to satisfy different memory access requirement. Fig.2. 19 shows the configurations of different layers of the proposed memory controller. Layer 0 is called memory interface socket (MIS), it is a configurable, programmable, and high-efficient SDRAM controller for designers to rapidly integrate SDRAM subsystem into their designs. Layer 1 is quality-aware scheduler (QAS), it is a memory controller layer which has the capability to provide quality-of-service guarantees including minimum access latencies and fine-grained bandwidth allocation for heterogeneous processor elements in SoC designs. Moreover, Layer 2 built-in address generator (BAG) designed for multimedia processor elements

(33)

can effectively reduce the address bus traffic and therefore further increase the efficiency of on-chip communication.

Fig.2. 19 Configurations of different layers of the proposed memory controller

Nikolov [2.24] present an efficient multiprocessor platform which separated the data communication path and memory data access path. Soininen [2.25] presents the smart memory tile architecture to improve the memory bandwidth and performance. Ipek [2.26] proposed a self-optimizing memory controller which base on reinforcement learning concept. And in order to adjust the memory access scheduling dynamically, Zheng [2.27] proposed a ME-LREQ(Memory Efficiency-Least Request) policy.

Besides, many SoC and computer systems require DRAM devices to store data. Due to the 3-D(bank, row, column) structure, modern DRAM devices have non-uniform access latencies [2.28]. Continuous memory accesses directed to the same row of the same bank have less access latency than directed to the different row of the same bank because row conflict would not occur. Many researchers have demonstrated that rearrange and execute the memory requests out of order can significantly reduce the low conflict rate and improve the memory bandwidth efficiency. Shao [2.28] proposed a burst scheduling mechanism to maximize bus utilization of the SDRAM device. With this scheduling, memory accesses to the same rows of the same banks are clustered into bursts. Subsequently, Hu [2.29] proposed new memory access schedule algorithms overcame the starvation problem in burst scheduling.

2.3.3 Modern DRAM Development

(34)

very fast. Fig.2. 20 shows the roadmap of DDR SDRAM family from 2001 to 2008. The bandwidth significantly increased in these years. For discussing the DRAM, the important issues are bandwidth, latency, and power. This section will introduce the development of DRAM that improve the performance and the future trend of DRAM.

Fig.2. 20 DRAM roadmap

2.3.3.1 Bandwidth

The improvement of DRAM bandwidth has never satisfied the increasingly complicated application such as multimedia and 3D processing. To fulfill the demand for high bandwidth, various new DRAM specifications have been announced by DRAM manufacturers. The SDRAM standards supported by JEDEC [2.30] have become the mainstream of DRAM market. Several techniques have been applied on the latest standards announced by JEDED to provide users higher bandwidth.

Component I/O bus clock (MHz)

Data transfer rate (MT/s) Peak transfer rate(MB/s) SDR 133 133 532 DDR 200 400 3200 DDR2 533 1066 8533

(35)

DDR3 800 1600 12800

Table.2. 3 The maximum transfer rate for SDR, DDR, DDR2 and DDR3

Table.2. 3 shows the maximum data transfer rate of SDR, DDR, DDR2 and DDR3 components. In SDR, the data transfer rate is equal to the I/O bus frequency, and the data is transferred at the positive edge of clock. In the DDRx standards, the data is transferred at positive and negative edge of clock. The data rate of these standards is twice as the I/O bus clock frequency. In addition, the PREFETCH technique makes DRAM be able to provide quadruple bandwidth than SDR with core frequency remains unchanged.

2.3.3.2 Latency

The DRAM response latency can directly influence the speed of the whole system. The speed of the system for the multimedia process is very essential to achieve the real-time request. So if the DRAM latency is shorter, the whole system can boost its performance. However, the situation is not as we expected. Fig.2. 21 compares the performance trend of CPU and DRAM. While CPU clock speed increases 7.65 times, DRAM latency also has a 4.6 times increase. The improvement of CPU is much faster than the improvement of DRAM. Long response latency waste its processing power on waiting and the performance is limited.

CPU clock speed

DRAM peak BW DRAM latency 7.65x 3.32x 4.6x year Fig.2. 21 CPU v.s. DRAM performance

One way to reduce the access latency is that parallel execute the accesses which address to different banks as much as possible. The successive accesses to the same

(36)

bank cost more latency than the successive accesses to the different banks. The timing diagrams of successive accesses with same and different bank are shown in Fig.2. 22 and Fig.2. 23 respectively. If the number of banks increases, the rate of accessing different banks can be increased. Table.2. 4 shows the number of banks of the DDR family.

SDR DDR DDR2 DDR3 Number of banks 4 4 4,8 8

Table.2. 4 Number of banks for SDR, DDR, DDR2 and DDR3

PRE

NOP NOP ACT read bank1 bank1 bank1 row19 NOP Data1_1 Data1_2

NOP PRE NOP ACT NOP read

Data1_1 Data1_2 bank1 bank1 row8 bank2 col7 col22 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System Clock Bank Address Row / Column Address Command Bus Data Bus NOP

Fig.2. 22 Accesses addressed to same bank

System Clock Bank Address Row / Column Address Command Bus Data Bus bank1

NOP PRE PRE bank2 ACT bank1 row19 read bank1 NOP col22 ACT bank2 row8 Data1_1 Data1_2 read bank2 col7 NOP Data2_1 Data2_2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 idle

Fig.2. 23 Accesses addressed to different bank

2.3.3.3 Power

In many application of portable wireless devices such as mobile and PDA, power consumption is the significant issue because of battery life is limited. With the application of multimedia becomes popular, the request of memory size is larger. Accordingly, the designers often select DRAM to be the body memory component.

In order to reduce the power of DRAM, many products have been invented for low power such as BAT-RAM from micron [2.31] and Mobile-RAM from Infineon [2.32].

(37)

The low-power DRAM has some special features inside.

Low Operating voltage

Compare with SDR SDRAM, the operating voltage of low-power DRAM is lowered from 3.3v to 1.8v. Thus, the power consumption can significantly be decreased. For the DDR family, the supply voltage is shown in Table.2. 5.

SDR DDR DDR2 DDR3 Supply voltage 3.3V 2.6V 1.8V 1.5V

Table.2. 5 Supply voltages for DDR family

Output Driver Strength

Because the low-power DRAM is designed for use in smaller systems that are typically point-to-point connection, an option to control the drive strength of the output buffers is provided. Drive strength should be selected based on expected loading of the memory bus. There are four allowable setting for the output drivers, including full strength driver, half strength driver, quarter strength driver, and one-eighth strength driver.

Temperature Compensated Self Refresh (TCSR)

Most of the time mobile devices stay in standby mode and DRAM can enter sef refresh mode to save unnecessary power consumption. In the self-refresh mode, DRAM will refresh the data stored in the DRAM cell. The refresh period is inversely proportional to temperature, traditional DRAM can only support single refresh period which is the worst condition. In the low-power DRAM, a temperature sensor is implemented for auto control of the self refresh oscillator on the device. Therefore, the refresh current is decreasing while the temperature is low.

Partial Array Self Refresh

For further power savings during SELF REFRESH, the PASR feature enables the control to select the amount of memory that will be refreshed during SELF REFRESH.

(38)

One method of controlling the power efficiency in applications is to throttle the clock that controls the SDRAM. There are two basic ways to control the clock:

1. Change the clock frequency, when the data transfers require a different rate of speed.

2. Stopping the clock altogether.

Both of these are specific to the application and its requirements and both allow power savings due to possible fewer transitions on the clock path.

The clock can also be stopped altogether if there are no data accesses in progress, either WRITE or READ, that would be affected by this change; i.e., if a WRITE or a READ is in progress, the entire data burst must be through the pipeline prior to stopping the clock.

For the full duration of the clock stop mode. One clock cycle and at least one NOP is required after the clock is restarted before a valid command can be issued.

It is recommended that the DRAM be in a pre-charged state if any changes to the clock frequency are expected. This will eliminate timing violations that may otherwise occur during normal operations.

Power-Down

Power down can occurs when all banks idle, this mode is referred to as precharge power-down. If power down occurs when there is a row active in the bank, this mode is referred as active power-down. Entering power-down mode deactivates all input and output buffers, therefore the power is saved.

Deep Power-Down

Deep power down is an operating mode used to achieve maximum power reduction by eliminating the power of the memory array. Data will not be retained when the device enters power-down mode. Since DRAM is often used as temporary data buffers, enter DPD mode while the device is in standby mode won‟t cause any loss.

(39)

Chapter 3 Memory-Centric On-chip Data

Communication Platform for Wireless Video

Entertainment Systems

In this chapter, a memory-centric on-chip data communication platform is developed for wireless video entertainment systems. First of all, the introduction and motivation of the wireless video entertainment systems will be depicted in the section 3.1. Subsequently, section 3.2 will describe the concept of the memory-centric on-chip data communication platform. And then the development of the wireless video entertainment systems will be introduced in the section 3.3. Finally, wireless video entertainment systems will be constructed in memory-centric on-chip data communication platform, and it will be described in section 3.4.

3.1 Motivations

Fig.3. 1 Wireless Video Entertainment Systemss

With the advancements of the wireless communication and multimedia techniques, various digital communication products are developed in our life. These modern

(40)

electronic products provide more convenient communication environment and media enjoyment for humans than those before. However, with different applications or standards, a variety of devices would be needed. Fig.3. 1 illustrates a heterogeneous network environment in our life. In recent years, merging different networks, electronic appliances and media devices into a heterogeneous integrated platform becomes an important issue that enables people enjoy their life in an more friendly and energy-efficient digital environment.

(a) (b)

Fig.3. 2 Homogeneous multi-core platform (a) Intel Polaris (b) Tilera TILEPro64TM Processor

Fig.3. 3 Trend of the data transmitting bandwidth

To integrate various applications into a system, a multi-task/multi-core concept provide a typical solution to build the system. The design of multi-core platform is a popular research area recently [3.1]-[3.7]. Fig.3. 2 shows two homogeneous multi-core platforms. Intel proposed an 80-core platform as shown in Fig.3. 2(a) [3.1] and Tilera [3.2] proposed a 64-core platform as shown in Fig.3. 2(b). These multi-core platforms can execute billions of operation per second. Furthermore, the data transmitting bandwidth for the multi-core platform is increasing year by year as shown in the Fig.3. 3. However, the overall system performance could be limited by

(41)

the task partitioning, task mapping, memory resource allocation, and memory data accessing. Fig.3. 4 indicates the bottlenecks of multi-core platforms with insufficient memory bandwidth and memory capacity for supporting high communication efficiency in the multi-core systems. With ongoing development of multi-core or multi-task system, both the memory capacity and memory access bandwidth are required. Enabling multiple memory data access is necessary for improving the memory bandwidth. However, increasing the memory read/write ports not only increases the hardware complexity but also reduces the memory performance and noise immunity. Conventional memory access method cannot provide enough memory bandwidth for multi-core platform. Hence, the memory management in multi-core or multi-task platform will become more and more important. It is an essential issue that reducing additional memory access and increasing the memory bandwidth effectively. For these reasons, a memory-centric on-chip data communication platform will be proposed and introduced in the following section.

Fig.3. 4 Comparison between memory bandwidth, memory capacity and communication efficiency in multi-core systems

3.2 Memory-Centric

On-chip

Data

Communication

Platform

3.2.1 Overall Architecture

To solve the problems as mentioned above, a hierarchy memory-centric on-chip data communication platform is proposed and the architecture is shown in Fig.3. 5. Heterogeneous processing elements such as microprocessors and application-specific stream processors can be integrated in the platform. In this platform, each processor

(42)

element owns distributed memory management unit (d-MMU). The d-MMU includes local cache (D-cache and I-cache) and cache controller which can efficiently handle all memory requests generated by the processor elements. It can dynamically allocate unused space in cache for buffering the transmitting data. If processor elements need additional memory resource requirements, the centralized memory resources including centralized cache and off-chip DRAM can be used. It is controlled by a centralized memory management unit (c-MMU). It can dynamically allocate and manage the memory resources according to different memory requirements.

For the data communication between processor elements, message-passing technique is applied for this platform. The processor elements transmit/receive the data to/from others through an on-chip interconnection network. Network interface is applied to packetize the transmitted data to interconnection and de-packetizes the received data from interconnection. Furthermore, in order to have better energy utilization for green computing, the power management unit can be applied to dynamically control the supply voltage and operating frequency of each processor element for saving energy consumptions.

RISC Centralized Memory (L2 Cache) D-Cache I-Cache Centralized

MMU Interconnection Network

Voltage/ Frequency DSP D-Cache I-Cache Voltage/ Frequency D-Cache WPU Power Management Unit DSP D-Cache I-Cache Voltage/ Frequency MAC SVC Memory-Centric On-Chip Interconnection Network Voltage/ Frequency Voltage/ Frequency Voltage/ Frequency

d-MMU : Distributed Memory Management Unit NI : Network Interface

D-Cache : Data Cache I-Cache : Instruction Cache

D-Cache D-Cache O ff -C h ip D R A M d-MMU NI d-MMU NI d-MMU NI d-MMU NI d-MMU NI d-MMU NI

Fig.3. 5 The architecture of memory-centric on-chip data communication platform

In the heterogeneous multi-task platform, different processor elements would have quite different memory requirements with different specific functions in a platform.

(43)

For instance, the memory requirement of the video decoding is larger than that of the wireless processing unit. Moreover, different system environment factors may affect memory utilizations for the applications in platform during runtime. Different qualities of wireless channels may have different memory behavior in a wireless video integrated system. Thus, a multilevel memory hierarchy on-demand memory system is applied for this platform. The memory system enables the processing elements to own different memory resources dynamically. In the following section, the concept of on-demand memory system will be introduced.

3.2.2 Concepts of On-Demand Memory System

In on-demand memory system, a three-level memory hierarchy is constructed, and the illustration is shown in Fig.3. 6. For the first hierarchy level, distributed memory management unit (d-MMU) is applied to control the memory accesses. It includes distributed cache and cache controller for processor elements. Furthermore, in order to improve the transmitting efficiency for data communication, d-MMU can dynamically allocate unused space in distributed cache to store packet data so that the stall caused by data blocking can be prevented. The detail design of d-MMU will be described in chapter 4.

For the second level hierarchy of the on-demand memory system, centralized memory management unit (c-MMU) is constructed to provide more memory resources for processor elements. In c-MMU, a cache controller and centralized cache is included. In addition, the configuration of centralized cache can be dynamically adjusted according to the different memory requirement from processor elements. For example, if a processor element need larger memory requirement than others, it can own more centralized memory resources than other processor elements. Adaptive cache control in c-MMU controls the adaptive allocation and cache operation. In addition, unused memories can be power down to save memory power consumptions for green computing.

For supporting enough memory space, off-chip DRAM is applied, and it is the third memory hierarchy level in the system. DRAM controller is needed to access the off-chip DRAM devices. It includes an external memory interface and address translator to improve the memory access efficiency.

(44)

In the on-demand memory system, all processor elements own a private address space and can dynamically be allocated. For data switching between processor elements, message-passing mechanism is used. On-chip interconnection network in the platform is designed for data communication. Note that the thesis is focus on on-demand memory system. The design of interconnection network is not included in this thesis.

In conclusion, adaptive memory resource allocation can be achieved and the memory utilization can be improved by the memory management units. The detail organizations and the design of these memory management units are described in chapter 4. Distributed cache (L1 cache) Centralized cache (L2 cache) Off-chip DRAM PE n PE 2 PE 1 Unused

(Power down) For

PE n For PE 2 For PE n Message Passing Buffer Dynamic configuration (Controlled by d-MMU) Dynamic configuration (Controlled by c-MMU) For PE 1 For PE 1

Processor element (PE)

d-MMUs

c-MMU

Fig.3. 6 Illustration of the memory hierarchy in on-demand memory system

3.3 Wireless Video Entertainment Systems

With the ongoing advancement in digital and communication techniques, digital home service becomes a trend nowadays. In the daily life, home is the personal headquarters for living, keeping personal assets and information. If the digital home services are applied, the residents will effectively participate in any events happening in the local, national and global communities without unnecessary travel. Digital home technique integrates wireless, wired physical transmission and multimedia

應用於無線影像娛樂系統的隨選記憶體系統

國

立

交

通

大

學

電子工程學系 電子研究所

碩

士

論

文

應用於無線影像娛樂系統的隨選記憶體系統

On-Demand Memory System for Wireless Video

Entertainment Systems

研 究 生：張 雍

指導教授：黃 威 教授

應用於無線影像娛樂系統的隨選記憶體系統

On-Demand Memory System for Wireless Video

Entertainment Systems

研 究 生：張 雍 Student：Yung Chang

指導教授：黃 威 教授 Advisor：Prof. Wei Hwang

國 立 交 通 大 學

電 子 工 程 學 系 電 子 研 究 所

碩 士 論 文

應用於無線影像娛樂系統的隨選記憶體系統

學生：張 雍

指導教授：黃 威 教授

國立交通大學電子工程學系電子研究所

摘 要

On-Demand Memory System for Wireless Video

Entertainment Systems

Student : Yung Chang

Advisors : Prof. Wei Hwang

Department of Electronics Engineering & Institute of Electronics

National Chiao-Tung University

ABSTRACT

Acknowledgements

Contents

Chapter 1 Introduction ... 1

Chapter 2 Related Researches of Memory Systems ... 5

Chapter 3 Memory-Centric On-chip Data Communication Platform

for Wireless Video Entertainment Systems ... 28

Chapter 4 Hierarchy Memory Management Units for On-Demand

Memory System ... 45

Entertainment Systems ... 77

Chapter 6 Conclusions and Future Work ... 103

Bibliography ... 106

Vita... 117

List of Figures

Fig.2. 1 Memory hierarchy... 5

Fig.2. 2 A simple cache memory. ... 7

Fig.2. 3 A four-block cache configured as direct mapped, two-way set

associative, and fully associative. ... 7

Fig.2. 4 Associativity-based partitioning organization for reconfigurable

caches ... 8

Fig.2. 5 Overlapped wide-tag partitioning organization for reconfigurable

caches ... 9

Fig.2. 6 A selective-ways organization. ... 10

Fig.2. 8 Tiles - A physical organization of molecules. ... 11

Fig.2. 9 Different steps in cache access in the molecular cache ... 11

Fig.2. 10 An example of typical CMP cache partitioning ... 12

Fig.2. 11 Basic structure of the reconfigurable Amorphous Cache for

processors with large on-chip cache memories ... 13

Fig.2. 12 Simplified architecture of a DRAM. ... 16

Fig.2. 13 Bank state diagram. ... 17

Fig.2. 14 DDR3 Read command [2.12]. ... 18

Fig.2. 15 DDR3 Write command [2.12] ... 18

Fig.2. 16 State machine for storing page hit history information. ... 19

Fig.2. 17 Interlaced method ... 20

Fig.2. 19 Configurations of different layers of the proposed memory

controller ... 22

Fig.2. 20 DRAM roadmap ... 23

Fig.2. 21 CPU v.s. DRAM performance ... 24

Fig.2. 22 Accesses addressed to same bank ... 25

Fig.2. 23 Accesses addressed to different bank ... 25

Fig.3. 1 Wireless Video Entertainment Systemss... 28

Fig.3. 2 Homogeneous multi-core platform (a) Intel Polaris (b) Tilera

TILEPro64

Processor ... 29

電子工程學系電子研究所

研究生：張雍

指導教授：黃威教授

研究生：張雍 Student：Yung Chang

指導教授：黃威教授 Advisor：Prof. Wei Hwang

國立交通大學

電子工程學系電子研究所

碩士論文

學生：張雍

指導教授：黃威教授

摘要