• 沒有找到結果。

Simulation Results of the Adaptive Cache

Chapter 4 Hierarchy Memory Management Units for On-Demand

4.2 Centralized Memory Management Unit Organization

4.2.3 Simulation Results of the Adaptive Cache

In this section, the simulation results of the adaptive cache will be introduced. In the beginning of this section, the access latency and energy estimation method of memories will be introduced. Base on the measurement method, the simulations for static bank assignment and dynamic bank assignment will be described in section 4.2.3.2.

4.2.3.1 Latency & Energy Estimation method

For measuring the execution latency and energy consumptions including on-chip cache and off-chip DRAM, the estimation methods will be introduced in the following sub-sections.

4.2.3.1.1 Cache Latency Estimation

For verification and simulation, a cycle-driven model is development by SystemC.

With constructed systemC models of the memory management units, the cache access latency can be considered in simulation.

4.2.3.2.2 Cache Energy Estimation

To approximately estimate the cache energy consumptions in system level, CACTI 5.3 model [4.7], which is provide by HP Labs, is applied to characterize the energy consumption of memory elements in d-MMU and c-MMU. CACTI is an powerful model that enable users to measure cache and memory access time, cycle time, area, leakage, and dynamic power. According to the selected cache parameters, the corresponding dynamic energy and standby leakage power can be generated so that total energy consumption can be calculated.

4.2.3.2.3 DRAM Latency Estimation

The memory access latency can be estimated according to the Centralized MMU block size and selected DRAM configuration. When a cache miss occur, Centralized MMU will send the memory requests to DRAM and read the required block data and write back the replaced block data. Assume the block size is 64-byte in the Centralized MMU configuration, so four DRAM access commands with eight burst length will be generated for a block access. Fig.4. 26 shows the DRAM read latency for a block data in different situations. When the reference bank is in row closed state,

the activate command need to be issued for opening the particular row. After tRCD cycles, the read command can be issued for reading data and the read data will successively be read out after tCL cycles. The read burst length is set to 8 so the total data outputting time are 16 cycles. Note that the data in a block are generally located into the same row so the row-conflict status would not occur for a block data accessing. The timing diagram and cycle estimation is shown in Fig.4. 26(a). When the present row address is equal to the previous activated row in the reference bank, the activated command can be reduced and the relative cycle estimation is shown in Fig.4. 26(b). Adversely, the row-conflict would occur when the present row address is different to the previous activated row in the reference bank. The pre-charge and activate commands must be issued to change the row. Additional tRP cycles would be added in cycle estimation. The timing diagram with row-conflict is shown in Fig.4.

26(c).

ACT READ0 READ1

tRCD + tCL + 16 = 26 cycles

tRCD

(a). Row closed

READ2 READ3

tCL 16

READ0 READ1

tCL + 16 = 21 cycles (b). Row hit

READ2 READ3

tCL 16

(c). Row conflict

ACT READ0 READ1

tRP + tRCD + tCL + 16 = 31 cycles

tRCD

READ2 READ3

tCL 16

PRE

tRP

Fig.4. 26 DRAM latency estimation for different situations

4.2.2.2.4 DRAM Energy Estimation

In order to measure the DRAM energy consumption, the system power calculators

are provided by Micron Technology Inc. [4.9]. These models can estimate the power requirement of SDRAM devices in a system environment. These tools provide a friendly interface for estimating the memory power requirements needed in making important system architecture and design decisions. With an accurate estimation of power consumption, the system designer can quickly handle complex system trade-offs to optimize the system performance [4.9].

According to the selected DRAM and system configurations, the DRAM power consumption can be automatically calculated. The configuration summary in our simulation environment is shown in Table.4. 6. The example of the DDR3 configuration interface in the System Power Calculator is shown in Fig.4. 27 and the system configuration interface is shown in Fig.4. 28. To simplify the simulation, only one DRAM device is applied for measuring the power consumption produced by SVC memory accesses, so the number of rank is set to 1. For setting system configurations, the percentage of time that all banks are in a pre-charged state can be set to zero because the used DRAM page policy is open page policy. In addition, the DRAM page hit rate and the percentage cycles of access data between DRAM, which are marked in Fig.4. 28, would need to be measured for power calculation.

After setting these configurations, the DRAM power would be generated. Fig.4. 29 shows a summary of the power measurement result including background power, activate power and read/write/termination power. According to these results, the DRAM power in the system can be measured accurately. The detail documentation and tools of the System Power Calculator can be downloaded in website [4.9].

DRAM configuration

DRAM Model Micron 1Gb DDR3 SDRAM (MT41J64M16)

Configuration 64Meg x 16

Speed Grade -15E

System configuration

VDD 1.5V

Clock frequency 333MHz

Burst length Fixed to 8

Number of Rank 1

DRAM Page Policy Open page policy (The percentage of time that all banks on the DRAM are in a precharged state is set to 0)

Table.4. 6 Summary of system and DRAM Configuration

Fig.4. 27 DDR3 Configuration interface of the System Power Calculator

Fig.4. 28 System configuration interface of the System Power Calculator

Fig.4. 29 Summary of the power measurement result in the System Power Calculator

4.2.3.2 Simulation of Adaptive Cache

For various memory demands required by different PEs in memory-centric on-chip data communication platform, c-MMU can support reconfigurable bank assignment for PEs. In order to simulate the behavior of the stream applications in a heterogeneous system, task-level pipeline organization is applied for the simulation as illustrated in Fig.4. 30. Assume the stream application can be separated into four tasks and mapped to four nodes in platform. Each node forms a pipeline stage for application. According to different memory behavior in nodes, c-MMU allocates different number of SRAM banks for different nodes.

Task 0 Task 1 Task 2 Task 3

node 0 node 1 node 2 node 3

c-MMU

(With adaptive cache control) DRAM

Fig.4. 30 Organization of simulation

For adaptive cache simulation, the tasks with random memory access are applied in each node. A task is composed of 100 number of random memory accesses with a particular range. For pipeline behavior, the task in pipe N can be lunched when the task in pipe N-1 is done. In the simulation, it is assumed that the memory requirement of each node in different intervals of time can be profiled by system. By updating BAT in proposed c-MMU with profiled information, adaptive memory resources partition can be achieved. Suitable adaptive bank assignment can improve the execution time and energy consumption compare to the fixed bank assignment (every node owns equal number of banks statically).

Table.4. 7 lists the simulation information of the memory configurations. Here a pattern with memory requirement assumption is simulated in my simulation. Table.4.

8 lists the information of this pattern. By the assumption of memory requirements, BAT in c-MMU can be updated by system for bank assignment. Additionally, assuming the memory requirements would be different at runtime, the assumption for three intervals of time are listed in Table.4. 8. For simulation, 500 tasks would be finished in a time interval. With profiling the memory requirement and re-allocating the bank assignment, the memory resources in c-MMU can be utilized effectively.

When finishing 1500 tasks (three time intervals), 40.41% execution cycles and 48.54% memory energy reductions can be achieved compared to the fixed bank assignment method.

L1 cache (d-MMU) configuration

Cache Size 4KB

Number of banks 2

Associativity 4-way

Block size 32-byte

Replacement policy LRU

Write policy Write back

L2 cache (c-MMU) configuration

Cache Size 512KB

Number of banks 16

Associativity N-way, 1<=N<=16 (depend on bank assignment)

Block size 64-byte

Replacement policy LRU

Write policy Write back

External Memory configuration

Device DDR3 SDRAM

Channel/Rank/Bank 1/1/8

Size 128MB

Number of banks 8

Burst length Fixed to 8

DRAM Page Policy Open page policy

Table.4. 7 List of simulation information

Time T0 T1 T2

Table.4. 8 Memory requirement assumption and corresponding bank assignment for c-MMU

0

Fig.4. 31 Total execution cycles

Fixed Adaptive

Off-chip DRAM 3.51 1.76

On-chip Cache 0.25 0.17

0 1 2 3