• 沒有找到結果。

Chapter 1 Introduction

1.3 Organization

The organization of this thesis is depicted as following. The related researches of memory systems will be introduced in Chapter 2. In the chapter, the concept of memory hierarchy, the previous work of the reconfigurable cache, DRAM architecture, basic operation of DRAM, DRAM controller and modern DRAM development will be described.

And then, Chapter 3 presents a memory-centric on-chip data communication

platform with on-demand memory system for wireless video entertainment application. The development of the wireless video entertainment systems and the concept of on-demand memory system will be introduced.

Chapter 4 presents the design of Distributed and Centralized memory management units (MMUs) which are applied in memory-centric on-chip data communication platform. Buffer borrowing mechanism in distributed MMUs and adaptive cache scheme in centralized MMU are proposed for optimizing the memory resources utilization dynamically in on-demand memory system. To communicate with external memory, an efficient external memory interface will be presented. In addition, the memory latency and energy measurement methods will be introduced in Chapter 4.

Subsequently, a pre-fetch and DRAM data allocation schemes are proposed in Chapter 5 to improve the memory energy efficiency of Scalable Video Coding (SVC) functional block in wireless video entertainment systems. Pre-fetch command generator and address translator are applied in Distributed MMU and Centralized MMU, respectively. With these proposed schemes, the memory energy consumptions including on-chip cache and off-chip DRAM can be reduced significantly for decoding the video frames by SVC function. Finally, the conclusion and future work will be discussed in Chapter 6.

Chapter 2

Related Researches of Memory Systems

In this chapter, the related research of memory system including cache and DRAM systems will be introduced. Furthermore, the previous work of reconfigurable cache and DRAM controllers will be introduced, too. Firstly, the concept of memory hierarchy will be described in section 2.1. After that, the overview of cache and DRAM systems will be described in section 2.2 and 2.3, respectively.

2.1 Memory hierarchy

Fig.2. 1 Memory hierarchy

In computer or SoC systems, memory elements are necessary for data storage, and the most important development concept is memory hierarchy because a well-organized hierarchy enables the memory system to have both advantages simultaneously which are the fastest memory access time and the cheapest cost per storage bit. The memory hierarchy is base on a principle of locality including temporal and spatial locality. In general, the memory hierarchy is described as a pyramid which is shown in Fig.2. 1 [2.1]. The higher levels have better performance than the lower levels, but the cost per bit is on the contrary. In ideal, the processor element can access the data with the best memory access performance and have large memory space. Nowadays, the hierarchy is formed with Cache(SRAM), DRAM and Disk storage elements. The list of the performance and energy consumption is shown in Table.2.1. So far, there are no storage element can provide low cost, high bandwidth and low latency simultaneously. The memory hierarchy is built to hide the

negative characteristics and gain the positive characteristics of these memory technologies.

Table.2.1 Cost-performance for various memory technologies

According to different system requirement, the design and configuration of memory hierarchy will different. In the following sections, the previous work of the adaptive cache design and the external memory controller will be introduced.

2.2 Cache

2.2.1 An overview of Cache Memory

In the memory hierarchy system, cache plays an important role because it is the first level of the memory hierarchy. The basic operation can be illustrated by Fig.2. 2.

Assume the address width of the processor element is 32-bits, the address can be divided into three parts which are offset, Index and Tag. According to the Index value, the address selects a cache line and then check out the Tag. If the Tag of the address is equal to the Tag bits recoded in the cache line and the valid bit is 1, it means the wanted data is in the cache. The data will be delivered if hit. Note that the valid bit is used to indicate whether an entry contains a valid address or not. If the Tag is different or the valid bit is 0, it means that no requested data in the cache. The wanted data may be stored in the lower level memory. W hen the wanted data is found in the lower level, it would be written back to the cache and update the Tag entries.

Valid Tag Data

Fig.2. 2 A simple cache memory.

The mapped structure of the above example is called direct mapped because all the memory block address is directly mapped to a single location in the cache. Another extreme mapped method is called fully associative mapped which the memory block can be placed in any location in the cache. To find a wanted block in a fully associative cache, whole entries in the cache must be searched. The hardware cost significantly increases because it needs more number of parallel comparators. The middle mapped scheme between direct mapped and fully associative is called set associative. Fig.2. 3 shows the examples of different associativity structures for a four-block cache.

Tag Data

Tag Data Tag Data

Tag Data Tag Data Tag Data Tag Data

0

Fig.2. 3 A four-block cache configured as direct mapped, two-way set associative, and fully associative.

2.2.2 Reconfigurable Cache Techniques and Improvements

The best configuration of the cache on a system can be distinct from different application characteristics and design constraints [2.2]. Since no cache organization can fulfill the requirements of all applications [2.3], one way to overcome this problem is to create reconfiguration capabilities in the cache. Reconfigurable caches need some additional mechanisms that enable the on-chip SRAM cache to be dynamically partitioned and reused for other processor element. The aspects of the cache organization can be categorized according to different partitioning method, data consistency process, reconfiguration policy and the reconfigurable cache level [2.4].

In the following subsections, the basic concept of these cache organizations and previous works of the adaptive caches will be introduced.

2.2.2.1 Cache Partition methods

In order to resizing the cache size, the SRAM storage partition mechanism is a key challenge in designing a reconfigurable cache. There are several partition methods shown in below.

Associativity-based partitioning

Fig.2. 4 Associativity-based partitioning organization for reconfigurable caches

The associativity-based partitioning divides the reconfigurable cache into partitions at the granularity of ways of the traditional cache [2.4]. Fig.2. 4 shows the example and the comparison with conventional set-associative cache. This partitioning approach has several advantages. First, the organization only requires few changes to the current set-associative cache organization. The second one is that the different

requests which address to different partitions can be isolated from each other.

However, the drawback of this organization is that the number and granularity of the partitions are limited by the associativity of the cache.

Albonesi [2.5] proposed a selective cache ways method for on-demand cache resource allocation. The technique disables a subset of the ways in the set associative cache to have lower energy consumption. Parthasarathy [2.4] presented the reconfigurable caches for media processing applications, and the associativity-based partitioning mechanism was selected. In contrast to simply turning off some partitions in [2.5], it suggests using the partitions for alternate processor activities to enhance performance. Zhang [2.6] proposed the highly configurable cache architecture for embedded systems. The basic principle is also base on associativity-based partitioning.

The cache used a way concatenation technique so that it can be configured by software to be direct-mapped, two-way or four-way set associative.

Overlapped wide-tag partitioning

Another partitioning method is called overlapped wide-tag partitioning [2.4]. The different part to the conventional cache is indicated by the dark-shade regions shown in the Fig.2. 5. This partitioning increases the tag array bit size to support the maximum tag bit variation with various partition sizes. According to this organization, the size of partition can potentially be any size, but generally the size would be limited to be powers of two to have simpler implementation. The main drawback of this partitioning is that the data in all blocks requires be flushed when the resizing occur because the mapping of the address has been changed.

Fig.2. 5 Overlapped wide-tag partitioning organization for reconfigurable caches

Yang [2.7] proposed an i-cache design that the cache size can dynamically be

changed, and the cache partitioning method of resizing is similar to the overlapped wide-tag partitioning. After this work, a hybrid selective-sets-and-ways cache organization was proposed [2.8] to enhance the configuration flexibility. Fig.2. 6 and Fig.2. 7 show the basic structures of selective-ways and selective-sets resizable caches respectively. In addition, Ravi Iyer [2.9] proposed a CQoS : a work on heterogeneous caches regions. In its work, the set partitioning technique is applied in his organization schemes.

Fig.2. 6 A selective-ways organization.

Fig.2. 7 A selective-sets organization.

Molecular-based partitioning [2.10]

In many partitioning works, the cache SRAM is divided into several individual sub-caches. We categorize these partitioning methods as the Molecular-based partitioning. The separated caches could dynamically be reorganized according to different application requirements. Vardarajan presented the Morecular Caches which are composed of many small and reconfigurable building blocks called Molecules [2.10]. The design can dynamically adjust the configuration of the cache capacity, set-associativity, and line size. In their design, the cache accessed by a processor is an aggregation of molecules. The Molecular caches support selective enablement of

molecules according to different application requirements so that the dynamic power dissipation can be reduced. The physical organization of molecules is shown in Fig.2.

8. The „M‟ is the symbol of a molecule. 4-8 tiles are grouped into a tile cluster, and every cluster is associated with a tile controller named Ulmo. It processes the coherence traffic and tile-misses between clusters. Fig.2. 9 shows the cache access method. Each molecule is configured with the Application Space Identifier (ASID) which uniquely identifies a running application. Before any cache operation is performed on the molecules, an ASID match is performed to see if the molecule is eligible to perform the operation.

Fig.2. 8 Tiles - A physical organization of molecules.

Fig.2. 9 Different steps in cache access in the molecular cache

Kaseridis [2.11] proposed a Bank-aware dynamic cache partitioning for multicore

architectures. A typical allocation in their design is shown in Fig.2. 10. According to different memory resource requirement, the L2 cache banks are separated into eight parts for eight cores.

Fig.2. 10 An example of typical CMP cache partitioning

The sub-caches can be heterogeneous caches. In the CQoS work presented by Ravi Iyer [2.9], the heterogeneous caches technique has been used in its platforms. In addition, Benitez [2.2] presents the Amorphous Cache (AC) which is a reconfigurable L2 on-chip cache, and it is organized by heterogeneous sub-caches. Fig.2. 11 shows the AC structure and maximum cache size is 2MB. There are six sub-caches which the sizes are ranging from 64KB to 1MB. The AC uses configuration registers to organize the cache into different cache size and number of way set-associative. It has eighteen configurations because the cache size can be range from 64KB to 2MB and the set-associative can be 4, 8, and 16-ways.

Fig.2. 11 Basic structure of the reconfigurable Amorphous Cache for processors with large on-chip cache memories

2.2.2.2 Data Consistency

Another problems need to conquer is data consistency after resizing the cache.

Reconfigurable caches need a mechanism to ensure that the data which belongs to a particular processor element resides only in the partition associated with that particular activity [2.4]. Generally there are two approaches for the data consistency which are cache scrubbing and Lazy transitioning. The concept will briefly be introduced as follows.

Cache scrubbing

Cache scrubbing scheme moves all valid data to the new partition parts or lower levels of memory when the reconfiguration happened. At the time of reconfiguration,

this approach requires examining all the locations of the cache to check for their validity and performing suitable actions on valid data [2.4]. Cache-scrubbing would induce big overhead because of the huge data access. But it can be acceptable when the reconfiguration is infrequent.

Lazy transitioning

When the reconfiguration happened frequently, the other suitable scheme is that the data is lazily moved into its correct partition parts only when it is accessed. In order to achieve the scheme, it needs additional cache line information to indicate the user of the corresponding cache line. According to this information, the access which address to this cache line can be checked. Note that if a miss occur in the appropriate partition, other partitions must need to be checked because the data may laze in other partitions.

This method can avoid high overhead with moving large amounts of data when the reconfiguration happened, but it need more state storage and may increase the contention for the other SRAM partition parts.

2.2.2.3 Reconfiguration Policy and Detection

A reconfigurable cache needs a detection mechanism and reconfiguration policy to determine when to reconfigure. The cache reconfiguring strategy can be static or dynamic strategy. The cache resizing is done prior to the application execution when using static strategy. Instead of the static strategy, dynamic strategy reconfigure the cache organization when the application runtime. It needs a detection mechanism to dynamically monitor the performance and energy dissipation to determine when to reconfigure and what organization to be chosen. The mechanism can be software or hardware controlled.

According to different organization of the configuration caches, the reconfiguration policy and detection mechanism may be different. Albonesi [2.5] used a software-visible register, called Cache Way Select Register(SWSR), to enable/disable the particular ways. The SWSR was written and read by specific pre-defined instructions. The Performance Degradation Threshold(PDT) measured the performance degradation relative to a cache with all ways enabled. According to the measurement, it can select a suitable way organization for the cache. Kaseridis [2.11]

used the Mattson`s stack distance algorithm and the concept of Marginal Utility, which originated from economic theory, to be the assignment policy in bank-aware cache partitioning. Benitez [2.2] proposed a Basic Block Vectors(BBV)-based tuning technique to trace the loop characteristics of the program in the runtime, and it dynamically learned the configuration type by holding the previous CPI value.

The related works of the reconfigurable caches are shown in the Table.2. 2.

Work Partitioning [2.2] Molecular-based Cache

scrubbing

[2.4] Associativity-based Cache scrubbing

Software

controlled L1 Media processing

[2.5] Associativity-based Lazy transitioning

[2.6] Associativity-based N/A

Software

L1 I-cache General purpose

[2.8] Hybrid Cache

[2.10] Molecular-based Cache scrubbing

[2.11] Molecular-based N/A

Software

Table.2. 2 Related work of adaptive caches

2.3 DRAM

2.3.1 DRAM characteristic

Dynamic random-access memory(DRAM) have been widely used for providing additional off-chip memory storage capacity. Compare to the SRAM, the circuit of a DRAM cell is “dynamic” because the capacitors storing electrons are not perfect devices, and their eventual leakage requires that, to retain information stored there, each capacitor in the DRAM must be periodically refreshed [2.1]. However, the cost per bit is much cheaper than the SRAM. In the memory hierarchy, DRAM is a level below the on-chip SRAM (cache).

2.3.1.1 Basic DRAM architecture

DRAM architecture is usually composed of the data memories, address decoders, row buffer, mode register, data buffer. Fig.2. 12 shows a simplified block diagram. In this example, four banks share the address bus and command bus. Each bank has its own row decoder, column decoder, and sense amplifier. The mode register stores the DRAM operation mode, including burst length (BL), column address strobe latency (CL), and burst type, etc. Users can set the value of the mode register through address bus with proper command.

BANK0

SA & Row Buffer Column Decoder

Row decoder

Mode Register

Data buffer

ADDR

Fig.2. 12 Simplified architecture of a DRAM.

2.3.1.2 DRAM command and operation

The normal commands and its operation used in DRAM will be introduced as follows.

NO OPERATION (NOP):

The NOP command can prevent unwanted commands from being registered during idle or wait states. Operations already in progress are not affected.

ACTIVE:

This command is used to open a row in a particular bank. The row remains open for accesses until a PRECHARGE command is issued to that bank.

READ/WRITE:

The read/write command is used to initiate a read/write access to an active row, if auto precharge is selected, the row being accessed will be closed at the end of read.

PRECHARGE:

The precharge command is used to deactivate the open row in a particular bamk.

The bank will be available for a subsequent row access a specified time (tRP).

REFRESH:

The refresh command can be used to retain data in the DRAM.

A memory access operation, which simplified state diagram is depicted in Fig.2. 13, contains three operation including row activation (ACTIVE), column access (read/

write), and precharge.

IDLE ACTIVE

PRECHARGE PRECHARGE

ROW ACTIVE

COLUMN ACCESS

Fig.2. 13 Bank state diagram.

The active command opens a particular row in one of the bank, and copies the row data into the row buffer. The active command needs a latency period called tRCD to accomplish this operation. Then, after tRCD delay a column access command (read / write) can be issued to sequential access data or single data according to the burst length and burst type set in the mode register. During the tRCD time, no other commands can be issued to the bank. However, commands to other banks are permissible due to the parallel processing capability of each bank. For read operation, the valid data-out from the starting column address will be available following the CAS latency after the read command, as shown in Fig.2. 14. For write command in DDR3 SDRAM, the write data must wait a write latency and then sent to the DRAM.

The timing diagram is shown in Fig.2. 15. Finally a precharge command must be issued before opening a different row in the same bank.

Fig.2. 14 DDR3 Read command [2.12].

Fig.2. 15 DDR3 Write command [2.12]

2.3.2 DRAM controller techniques and Improvements

According to different applications or systems, the memory controllers can be categorized into two classes which are particular-purpose and general purpose memory controller. The particular-purpose memory controller serves one kind of specific application to reduce the memory access latency. In many multimedia applications, the advanced video processes need huge data storage space. In order to

support the real-time video environment, the system needs external memory storage to store the image frame data or motion information. But the memory access speed is much slower than the processor unit execution speed. Many researchers have shown the well memory management method according to the regular memory access behavior in video process can significantly improve the overall system performance.

Base on the different specific applications, there have several approaches been proposed to increase the efficiency of memory access for video coding applications.

Kim memory interface architecture [2.13] reorganizes data arrangement in synchronous DRAM to increase the row-hit rate. Park proposed a memory node control approach [2.14] for HDTV video decoder. It uses history-based prediction to predict the next command is row-hit or row-miss. If it predicts the next command is

Kim memory interface architecture [2.13] reorganizes data arrangement in synchronous DRAM to increase the row-hit rate. Park proposed a memory node control approach [2.14] for HDTV video decoder. It uses history-based prediction to predict the next command is row-hit or row-miss. If it predicts the next command is