調適提前讀取技術與重疊延遲排程之三維寬頻記憶體

(1)

國

立

交

通

大

學

資訊科學與工程研究所

碩

士

論

文

調適提前讀取技術與重疊延遲排程之

三維寬頻記憶體

Adaptive Prefetching Techniques and Latency Overlapping

Scheduling for 3D Wide I/O Memory

研究生：郭泰均

指導教授：陳添福教授

(2)

調適提前讀取技術與重疊延遲排程之三維寬頻記憶體

Adaptive Prefetching Techniques and Latency Overlapping Scheduling for 3D

Wide I/O Memory

研究生：郭泰均 Student：Tai-Chun Guo

指導教授：陳添福博士 Advisor：Dr. Tien-Fu Chen

國立交通大學

資訊科學與工程研究所

碩士論文

A Thesis

Submitted to Institute of Computer Science and Engineering College of Computer Science

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Master

in

Computer Science

June 2012

Hsinchu, Taiwan, Republic of China

(3)

i

調適提前讀取技術與重疊延遲排程之三

維寬頻記憶體

研究生：郭泰均

指導教授：陳添福博士

國立交通大學資訊學院資訊學程碩士班

摘

要

隨著記憶體與中央處理器存取速度的差距，記憶體已經成為系統效能的瓶頸，提升記憶體存取速度將有助於改善系統效能，由於近來 TSV 技術的成熟，晶片將可以用堆疊的方式來減少存取延遲時間。然而當系統晶片整合朝向三維發展，設計將會更複雜且效能更難被評估。本篇研究我們實作一個三維記憶體模擬平台並且支援 JEDEC 寬頻記憶體傳輸介面來評估系統效能。同時模擬器支援多執行緒平行處理來加快模擬時間。藉由分析寬頻記憶體的特性，我們提出了兩個機制來提升三維架構下的記憶體效能，調適提前讀取技術藉由分析記憶體區塊的存取密集程度及狀態並且參考指令列隊的數量來進行提前讀取，並且提高堆疊層間的平行度。重疊延遲排程則是利用 TSV 傳輸延遲來提前執行充電指令，達到重疊延遲時間。

(4)

ii

Adaptive Prefetching Techniques and Latency Overlapping

Scheduling for 3D Wide I/O Memory

Student: Tai-Chun Guo

Advisor: Dr. Tien-Fu Chen

Degree Program of Computer Science

National Chiao Tung University

ABSTRACT

Due to the gap between memory and CPU speed, memory has become a bottleneck in

computing systems. Improving memory access latency will improve system performance.

As TSV technology matures, chips stacked in different stratums can reduce access latency.

However, as SoC development moves towards 3D, it becomes increasingly difficult to

evaluate complex systems designs.

In this thesis an ESL platform is implemented which can support JEDEC wide I/O

interface to evaluate memory performance. The simulator supports multi-threaded modeling

and speedups the simulation time. After analyzing address mapping methods and properties

of wide I/O, this thesis proposes two mechanisms to improve the performance of 3D

architecture. Adaptive-prefetching will analyze memory intensive blocks and reference

command queue status to prefetch data and improve RLP. Latency overlapping scheduling

(5)

iii

誌

謝

感謝我的指導教授，陳添福教授，給予我機會和指導與幫助，從教授身上學到許多寶貴的經驗與知識，和研究應抱持的態度。感謝口試委員：莊俊雄組長、黃婷婷教授、賴柏承教授，在口試時，給予相當多的指正與叮嚀，讓我瞭解更多東西及知道我 思考不周的方向。 感謝楊耿豪學長在我撰寫論文期間，給予我相當多的建議，感謝許雅淳學姊、蘇資翔學長，在我遇到問題的時候，都能給予我適當的幫助。感謝所有實驗室同仁，不論在計畫、課業或是論文上，都能夠互相幫助、互相鼓勵，另外，感謝在口試當天學弟妹們的幫忙，使得口試得以進行得相當順利、流暢。最後，感謝所有幫助過我的學長姐、同學、學弟妹們，有你們的幫忙，讓我能夠有豐富的收穫，才有今日的我，感恩。最後，感謝家人的支持，由於你們的支持，我才能繼續下一個旅程，謝謝你們。

郭泰均

2011/08/16 於國立交通大學 (Lab. 446A)

(6)

iv

List of Tables

Table 1 Comparison of Scheduling ... 12

Table 2 FDP Control Table ... 13

Table 3 Comparison of Prefetching ... 15

Table 4 System Configuration ... 33

Table 5 Overview of Benchmark ... 34

Table 6 Memory Configuration ... 35

Table 7 LPDDR2 Power Parameter ... 36

Table 8 Precharge Predict Overhead ... 43

(9)

vii

List of Figures

Figure 1 Memory with 3D stacking ... 2

Figure 2 Row Buffer ... 6

Figure 3 Memory System Overview ... 8

Figure 4 TSV Technology ... 9

Figure 5 FR-FCFS Command Scheduling ... 11

Figure 6 Program Execution Flow ... 16

Figure 7 Pthread Accelerating Method ... 21

Figure 8 Simulation Speedup Times ... 22

Figure 9 Different Address Mapping Method ... 25

Figure 10 Prefetching Table ... 28

Figure 11 Prefetching Flowchart ... 29

Figure 12 Timing Diagram for Hide Latency Scheduling ... 30

Figure 13 Bank Precharge Prediction Table ... 31

Figure 14 Bank Precharge Prediction Scheduling Flow ... 32

Figure 15 Precharge Prediction Table Size ... 37

Figure 16 Latency Improvement by different Scheduling ... 38

Figure 17 Power Improvement by different Scheduling ... 38

Figure 18 Prefetching Caching Size Tradeoff ... 39

Figure 19 Improvement Relative to No Threshold Prefetching ... 40

(10)

viii

Figure 21 Latency Improvements by Adaptive-Prefetching ... 41

(11)

1

I. Introduction

1.1 Motivation and Introduction

Memory System is one of the most important in computer architecture, but the speed of

memory access is improving much slower than CPU. The gap of memory and CPU speed is

called “Memory Wall” and it has been a system bottleneck for two decades. One of the reasons is due to the different between memory and logic IC fabrication process. DRAM is

made by high-density NMOS transistors which emphasize on capacity or leakage power. On

the other hand, processors use CMOS transistors and optimize for performance. The

difference means that logic layer and DRAM layer can’t coexist because IC fabrication

process constraints.

Recently, memory bandwidth demands and chip size are still increasing since multicore

era. The new technology called “TSV” may solve this problem. With the advancement of

technology, 3D SoC integration can stack multiple stratums of chips and breaks area and pin

count constraints. It also brings other advantage such as lower latency and more power

efficiency. Although 3D stacked memory is more powerful than traditional designs, it’s hard

to evaluate the complex SoC design. In order to analyze performance of this new

technology, building an accuracy model to simulate and provide some information for the

3D stacking technology in the early stage is necessary. Using a platform which is built by

(12)

2

design and this flow is generally called ESL[1][2].

Logic layer is designed to be at the bottom stack of the stacked chip architecture for

mobile wide-I/O DRAM, as shown by Kim[3]. Traditional memory is constrained by the

specifics of processors and is hard to improve performance. Memory controller design can

be more complex and separates from processor side to memory logic layer. One of the

advantage is the memory controller can be optimized by its own memory manufacturers.

Figure 1 show the design of 3D stacked memory. There are some issues that need to be

addressed in order to improve memory performance in wide I/O.

Figure 1 Memory with 3D stacking

First, balance resource’s utilization by analyzing memory address mapping. Address

mapping affects the banks or ranks usage. It is hard to decide optimized mapping method

for different memory systems because diverse memory hardware constrains and distinct

properties. This work explores wide I/O property with different mapping method by traces

analysis.

(13)

3

difficult for prefetcher to get memory state information traditionally because it is designed

close to CPU. With 3D SoC integration, memory controllers can combine more complex

logic design. This thesis proposes adaptive prefetching technique by analyzing memory

block intensively and the possibility of continuous adjacent addresses. In order to reduce

useless prefetch requests, this work further use a set of counters to observe the command

queue state and improve memory rank level parallelism.

Third, open-page row buffer policy is widely used because most programs have spatial

locality when access memory. DRAM controller will issue precharge command to close row

buffers until other command use other rows in the same bank. It may waste power if the row

buffer is not used more efficiently. Moreover, TSV data bus is shared by each rank and

transfers data with burst length mode. The following data commands have to wait for the

transmission complete. This thesis try to pre-issue precharge command when TSV bus is

busy and overlapping the latency.

Finally, discuss the extra hardware cost in our experiment.

1.2 Contribution

1 Provide an ESL platform that can model Wide I/O interface. It can help to evaluate 3D

IC integration development in the early-stage.

2 Using pthread to speed up simulation time.

3 Analyze the address mapping method and property of wide I/O interface.

4 Provide two mechanism: Adaptive prefetching techniques and latency overlapping

(14)

4

1.3 Report Organization

The remainder of this thesis’ structure is as follow. Section II introduces the background of DRAM and the policies to improve system performance like prefetching or

predict precharge. Section III describes the framework of our 3D DRAM simulator. Section

IV describes the scheduler and policy to hide latency policy in order to improve the DRAM

access latency. Section V introduces our experiment environment. The result and cost are

also taken into consideration. Finally this thesis compares our method with others and

(15)

5

II. Background and Related Work

TSV technology is the trend of future design. This chapter will briefly describe some

basic concepts of DRAM property and 3D wide I/O interface backgrounds. Next, this work

introduces previous works that utilize prefetching and scheduling method to improve

DRAM performance. Finally, we will present a table of analysis advantage and drawback

analysis on those control method in 3D IC era.

2.1.1 Row Buffer

Row buffer is like a cache of bank arrays. It is a very important component in the

DRAM system because every access will go through this buffer and transfer by amplifier to

enlarge the signal. Figure 2 shows a typical row buffer design. Upon memory access, the

entire row of DRAM bank is amplified by a sense amplifier before loading into row buffer

and in the process, the row data in DRAM is destroyed. This action is called “ACTIVATE”.

After the row buffer is activated, the following row access is related to the row buffer.

Because activate is a destructive access, data must be restored into DRAM before another

row is used. The restore action is called “PRECHARGE”. The maximum number of

activated row buffers is limit per rank. For example, DDR2 and DDR3 use FAW(Four

activated windows), but Wide I/O uses TAW(Two activated windows). Activated command

(16)

6

relationship to each operation. “tRCD” is the minimum time interval that can issue memory

access command after the activated action. “tCL” is the minimum timing interval between

memory access command and data put into the data bus. “tRP” is the minimum timing

interval between recharge. “tRCD”, “tRP”, and “tCL” are the most frequently discussed

timing parameter in memory systems.

Figure 2 Row Buffer

2.1.2 Row Buffer Page Policy

While the memory is accessing data, there are three common timing parameters:

precharge, activated, and access column. If the row being accessed is currently opened, it

can directly access data without precharging or activating the row. These three primary row

buffer page policies are open page policy, closed page policy and timing policy.

 Open Page Row Buffer Policy

(17)

7

for data with spatially locality because the row buffers do not have to be activated

twice in a period. If the access commands are mapped to the same opening row, it can

directly access the DRAM data without other commands. And it will issue

PRECHARGE command until a new command accesses a different row. The open

page policy is widely used in general-purpose computer.

 Close Page Row Buffer Policy

Close page policy will close the row buffer as soon as the command access is

completed. It will take advantage of random memory access because it may save

precharge time. If the rate of row buffer hit is low, close page policy may be better than

open page policy.

 Timeout Policy

Timeout policy will close the opening row after a time interval. This policy will

provide tradeoff between open and close page policy.

2.1.3 Memory System

Recently the DRAM system has one or more channel controllers design. Each

controller controls a separate portion of the memory system. Because of the command and

data bus are also duplicated, each controller can issue command or access data concurrently

without any dependency.

The basic DRAM system consists of one or more ranks. Each rank has many Banks and

(18)

8

system must delay command and pay the rank to rank switch time penalty.

Bank is the basic component of the DRAM system just like a two dimension data array

which is composed by many bit cells. Because each bank in the same rank can work

concurrently, it can hide the latency if multiple banks are activated and accessed data in

parallel. Each bank has a row buffer and the data access must go through the row buffer.

Bank 0 Row Buffer

Bank 1 Bank 2 Bank 3 Row Buffer Row Buffer Row Buffer Rank 3 Bank 0 Row Buffer

Bank 1 Bank 2 Bank 3 Row Buffer Row Buffer Row Buffer Rank 0 Memory Controller

Figure 3 Memory System Overview

2.1.4 TSV Technology

TSV is abbreviated from Through-Silicon-Via. The new technology lay chips stack by

stack 3 dimensionally. The delay of TSV is also shorter than tradition horizontal bus.

Memory can be stacked as a cube then add a new dimension call layer. The delay is directly

(19)

9

Wide I/O is based on the TSV technology to control data transfer of the stacked

memory. The feature of this device is it can stack up to four layers of memory. There are

four channel controllers in the Wide I/O interface, and each controller controls section of

the memory stack independently. The data bus is 128 bits wide. The total memory size can

be up to in 32 Gb in density. The voltage is only 1.2V and the frequency is 200MHz. Each

layer is an independent system, so the timing constraint is different from transition design.

For example, DRAM system can issue access commands to different rank in succession in a

3D memory system, but traditionally the timing delay constraint is exists which is call Rank

to Rank Switch time. Previous researches [4][5][6] analysis of TSV shows that 3D

integration technology will impact not only memory stacks, CPU caches and cores may also

need to be re-designed. Wide I/O 128 bits Rank_2 Rank_1 Rank_0 Rank_3 Channel_0 Channel_1 Channel_2 Channel_3 Channel Controller Channel Controller Figure 4 TSV Technology

(20)

10

2.2 Memory Controller Scheduling

Scheduling policy plays an important role in the Memory system. Because of the

destructive access control in DRAM, the access ordering will impact numbers of precharge

and activate new row into row buffer. Unsuitable access ordering will cause power

consumption and longer access latency. There are many researches about reorder the

command requests to improve the system performance. Different polices optimizes the

scheduler to adapt to a particular property. A good scheduling will improve the DRAM

bandwidth and delay. In this section, this thesis will introduce some state-of-the-art

scheduling policy about DRAM.

2.2.1 FR-FCFS Schedule

Because DRAM access pays less latency penalty when accessing already opened rows,

read latency can be significantly reduce with more page hits. Rixner proposed a scheduling

policy which is called “First Ready-First Come First Serve”[7] and it is the common policy

in today’s memory system. This scheduler will reduce latency as the following timing

diagram. FR-FCFS issues commands to already opened row buffer over other requests. It is

just like a greedy algorithm scheduling policy for improving row buffer hit ratio. If there are

no opening requests that can be issued now, the scheduler will prioritizes and issue oldest

(21)

11

Figure 5 FR-FCFS Command Scheduling

2.2.2 Fairness Scheduling

Because FR-FCFS scheduling policy is a greedy algorithm, it may cause starvation

when some requests are continuously served in the period. A memory intensive program

will send a lot of memory requests to DRAM and be issued quickly because the possibility

of those rows being opened is higher than other request. Other requests will delay for a long

time and programs will be stall because the data is not return from memory. Some

researches take requests waiting time into consideration and adjust the controller’s request

priority periodically. For example, the scheduler can give basic priority number for requests

and increase it after a time interval. This action gives higher priority to the older requests,

thus the read request delay times are more balance. The priority aging[8] concept will

reduce the unfairness situation for solving starvation and it is important since it affects the

CPU waiting time. There is a tradeoff between performance and fairness because it breaks

the continuous row buffer access times for fairness scheduling. Memory timing constraints

should also be taken into considers. Virtual writing queue[9] reduces memory read to write

delay penalty. By handling write command separate from read, this mechanism helps to

(22)

12

Thread status can also issue to the memory scheduling, because of shared cache

replacement and data bus, many resource may cause contentions. ATLAS[10] is a memory

scheduling algorithm that improves system throughput by prioritizing threads that have

smaller attained memory service. Those threads may more likely be memory non-intensive

and improve performance significantly by serving those threads’ request soon. PAR-BS[11]

focuses on thread fairness and prevent short term and long term starvation by taking

registers info per request and thread ranking. Although there are many method to improve

memory performance, most of them have to bypass some information from CPU to memory

controller and the design must also be constrained.

2.2.3 Scheduling Summary

Table 1 Comparison of Scheduling

Many scheduling are based on FR-FCFS policy to improve performance, and pass

information from CPU by adding extra cache tag. This method will increase bus

(23)

13

issue precharge command early and overlap the bus transmission time.

2.3 Prefetch Policy

Prefetching mechanism can predict the next address and fetch data beforehand. Stream

prefetcher can keep track of multiple access streams. “Prefetch Distance” and “Prefetch

Degree” determine the aggressiveness of the prefetcher. Aggressive prefetching mechanism

can improves performance, but miss fetch in some benchmark decreases system

performance.

In order to reduce the miss predict ratio some papers proposes to analyze the feedback

information and adjust the prefetcher dynamically. Feedback Directed Prefetching[12]

provides three factors to control prefetcher’s behavior: “Prefetch Accuracy”, “Prefetch

Lateness”, and “Cache Pollution” as show in Table 3.

Prefetch Accuracy Prefetch Lateness Cache Pollution Action

No Polluting Increase Polluting Increase No Polluting No Polluting Decrease No Polluting Increase Polluting Decrease No Polluting No Polluting Decrease No Polluting Decrease Polluting Decrease No Polluting No Polluting Decrease Medium Low Late No Late Late No Late Late No Late High

Table 2 FDP Control Table

As soon as the prefetching data returns, it will update the information and determine

whether to use prefetcher or not. These prefetchers are often incorporated in cores and

(24)

14

of the prefetch requests, the Last Level Cache has to add the entry to identify the demand

requests and prefetching requests. It is an overhead in overall architecture and does not fit

the DRAM property. Coordinated control [13] solves the problem about inter-core prefetch

requests conflict in cache. Prefetcher causes inter-core interference and may diminish

prefetching’s potential performance. In order to reduce conflict, it adds an extra center

controller to manage which prefetch requests can be sent into memory system. Lee et al.[14]

combines the prefetcher and memory architecture to improve bank level parallelism. Bank

is the basic component of the memory system, each bank can work independently. The

concept of bank level parallelism is to overlap delay by access bank concurrently.

BLP-Aware Prefetch Issue will keep track of which bank the prefetching requests will be

issued to and issue them out of order to improve the bank utilization. Furthermore,

BLP-Preserving Multi-core Request Issue Policy also considers the prefetching requests

from multi-core. It controls the order of prefetching requests and issues those from the same

core first. This mechanism minimizes the destructive interference in the BLP of each

program.

The general priority schedulers that deals with prefetch requests can be put into two

categories, one type of prefetching request schedulers[15–17] gives less priority to

prefetching requests than normal request. Since the prefetcher is assigned lower priority, the

normal request will be issued faster than the predicted prefetching requests. It will drop the

prefetching request if the prefetching requests waiting time is too long, as it is more likely

te be a useless demand. The other policy[18] treats prefetching requests as a demand request.

So it will schedule equally and fairly performance if the prefetcher prediction hit rate is

(25)

15

2.3.1 Comparison of Prefetch Mechanism

Table 3 Comparison of Prefetching

Many prefetchers access data and dynamic adaptive current state. There are some extra

components to record precise and pass into memory controller. In this work, we use a table

record memory intensive block and prefetch data will improve latency with low hardware

(26)

16

III. 3DRAMSim Framework

3.1 Overview

This section describes the feature of our 3DRAMSim simulator. 3DRAMsim is a

DRAM cycle accurate simulator which is based on DRAMSim2[19]. Our simulator can

provide 3D memory environment to analyze design in an early-stage. In our simulator, we

also can simulate traditional DRAM like SDRAM or DDR memory from JEDEC protocols.

Our simulator can easily model different type of memory by modify the system

configuration files and timing parameters. All DRAM properties are parameterized and

described in the memory configure file. With the development of TSV technology, system

can also control different layer independently and share the same data TSV bus. Our

simulator can also model multi-channel architecture and concurrently operate of DRAM.

(27)

17

Figure 6 illustrates the 3DRAMSim flow diagram. The proposed simulator includes

three parts: trace splitter, memory controller, and memory storage. The trace splitting stage

includes overall system timing control, and address mapping analysis tool. In order to

simulate the multi-channel environment, trace must be pre-process, split into sub-traces and

issued by each channel controller. Trace splitting has to know the address mapping method

and separate requests mapping into their sub-trace. After the trace splitter, each controller

increases their own timing domain count. If this counter is less than the trace’s timestamp,

the trace will issue until the counter equal to the timestamp and another request will not be

issued in the same channel. After all trace are finish, system will show those information

and physical execute time to help us evaluate our design.

3.2 Parameter Design

Many specifics described as parameters, can improve program flexibility and usability.

It can simulate other memory architecture by modifying some critical property with low

overhead. There are two classic configurations for our simulator.

First, memory configuration file will provide the timing and current parameter. It also

gives the architecture structure information for DRAM like number of banks and ranks.

Second, system configuration file provides the policies for our target DRAM. These

policies include row buffer policy, memory address mapping, and scheduling method. With

many policies, we have to modularize respective functions to make the program more

readable and easy to expand.

Finally, we use indirect pointer to link the function section in the initial time to save the

(28)

18

this pointer to call specific function directly. It can easily switch to other policies without

recompilation of the program, making it more efficiency.

3.3 Channel Controller

BankState is a table that describe the current bank state including current row buffer

state, current command issued, and requests action corresponding to reasonable cycle time

to fit the timing constrain. This state table will determine when the command can be issued

or not in this cycle and update the table state whenever a command is committed.

Channel controller is the main control of the memory system and operates disjointed

portions of memory stack independently. Command queues store those memory requests

scheduled by the scheduler in the buffers. Commands sequence will be reordered and issued

out of order to improve the memory performance. In our 3DRAMsim simulator, this work

provides the well-know FR-FCFS scheduling policy.

Each controller access a portion of the memory stack. They do not cause interference

and the clocks are maintained by its own domain. The timing clock is store in the controller

object and increases every time controller update called by the upper system. As soon as the

timer increases, next cycle will come and schedulers compare the bank state timing

constraint with the current clock to schedule those commands.

3.3.1 TSV Modeling

All ranks are shared by the same TSV bus in region of the same channel controller. The

(29)

19

TSV is 300 in wide I/O interface for each controller. Controller can continuously send a

command to different ranks every cycle if the command conforms to the timing constrain.

Because wide I/O is burst oriented, all the data accesses are uses burst mode to transfer.

The data bus must be locked a period by data burst transfer with many cycles. In wide I/O

interface, the burst length is either two or four cycles. Also, the stacked memory property

only restricts each timing constraint in the current rank. In other word, the data commands

may be sent continuously. The data bus competed by controllers, so it must have an arbiter

to determine which one can use TSV data bus in this period and when to release the

resource. There is a counter in our wide I/O bus model. It will add burst length to reflect

the bus transfer when the arbiter decides which controller has the next period. Memory

controller will call the lock function to prevent different ranks using the TSV bus when

transaction is not complete. After TSV data bus is locked, the counter will count down to

zero and then unlock the data bus. If the bus is free, the controller will select the earliest

command for scheduling and lock the data bus. It will simulate command continuously and

access data sequentially to transfer. Our TSV bus modeling can more accurately describe

data bus utilization. Simulator can also reflect the bandwidth by calculating the locking

cycle. It is an important metric for memory intensive programs.

In order to save power, it can switch to low power mode if the command queue is

empty and the state is idle for a rank. If there are other commands injected into the

command queue, the rank will switch to normal mode after in a period. It can calculate the

(30)

20

3.4 Accelerating Simulation

In recent development, DRAM has more and more ranks and controllers to improve

access latency. With the simulation components increasing, simulation time also become

much longer because each object has to update its state or transfer data every cycle in the

simulator. Some of them are unnecessary updates only check and increase cycle when the

command queue is empty. There are many potential opportunities to speed up the DRAM

simulator. For example, the previously mentioned indirect pointer saves unnecessary

comparison time when program executes. Also, it can find and improve program parallelism

issues to speed up simulation time. Our speedup mechanism in the 3DRAMsim focuses on

channel controllers because the control, data, and clock signals are critical to independent

control. It means different channel controllers can be executed on their own and the

executions are likely to be parallelized. Furthermore, the result of the speedup version must

be the same as original version to prove the correctness. In order to speed up the channel

controllers, this work has to modify the program execute about the channel update function.

The original version updates channel use loop and update sequentially whenever system

cycle increase and the ordering will not be affected because the controllers are independent.

If the system only has one trace for simulator, it is hard to reflect concurrent execution

for a multi-channel architecture. The trace splitter is described in the previous section. After

the memory system is created, it also creates many threads for each channel controller in

our simulator. Figure 7 show the concept of the thread creation for each channel. In our

multi-thread version, each thread updates their own ranks partially without dependency. The

simulation cycle is the maximum cycles in all the sub-trace returned from the trace splitter.

Because of threads’ difference in speed, it has to wait for all controller threads to finish its

(31)

21

not. The program will insert a barrier to wait for the slower job and print all the information

after all controllers simulate completed. The result in both versions must match to verify the

speedup version’s correctness.

CPU Model

Trace Porcessor Channel Controller Ranks Channel Controller Ranks Channel Controller Ranks Channel Controller Ranks

Figure 7 Pthread Accelerating Method

Figure 8 show our simulator speeds up 3.1 times compare to sequential version in

average. Considering threads diversity, some threads have to wait for the slowest one, the

degree of speedup is always nearly the number of threads if host machine has enough cores

and can support threads to execute independently. The maximum speedup will be

proportional to the number of channels and the upper bound is four times in wide I/O

(32)

22

Figure 8 Simulation Speedup Times

3.5 Profile Model

With the TSV technology coming, there are some differences from traditional DRAM

memory system. Controllers can issue different command to every rank by interleaving. The

rank can issue commands only when sharing the same data bus. The timing constraint in

different ranks only have to consider their layer inside. The property of the parallelism

factor will also be different because the ranks behavior also changes. For example, the rank

to rank switch time will impact the rank level parallelism and the activated row buffer

constraint will impact bank level parallelism. There are more complex factors for 3D IC

development, and the best addressing mapping policy will become harder to find. Our

simulator provides some simple methods that can find a better mapping method to fit your

demand. Better mapping methods can be found by two steps. After mapping analysis, the

simulator will find the trend of different mappings for the current architecture.

First step of evaluation is focused on the rank and channel utilization in the memory

(33)

23

channels mapping are unbalanced, it will cause some resource to idle, causing violent

competition in some rank. This step will eliminate the most of the impossible mapping

methods, but some case is hard to find because the number of commands will continued to

increase in disperse ranks for a long time.

Second, memory level analysis is important to mapping performance. It will run

3DRAMsim with trace by without timestamp. This step will show parallelism because the

command dependency will be roughly taken into consideration. The simulation time is

faster than the complete simulation because it does not have to wait for the trace interval for

command issues. This work chooses those potential methods as candidates and analyze

artificially. The traditional mapping methods consider the bank level parallelism and

column access locality in the simple memory system.

The results show mapping channel’s bits to lowest bits is better because when mapped

to upper bits, the bits are more unbalanced and have less channel level parallelism, since

having independent channel controllers achieves the maximum parallelism. Moreover, the

row bits are mapped to higher bits because this way there is no parallelism issue between

each row. If the row bits are mapped to the lower bits, it may more likely use a different row

(34)

24

IV. Control Mechanism

4.1 Address Mapping Analysis

Semi-automatic method can eliminate those low performance combination mapping. It

shows the row bits mapping to high and channel bits mapping to low get better performance

for wide I/O interface. This thesis analyzes the other parallelism issue to fit the 3D

architecture by run simulation completely. The results of different mapping show in Figure

9 as follow and find the best mapping method. The value is normalized to the first mapping

method.

The x-axis is mapping sequence’s permutation combination of rank, bank, and column.

The first and second methods are mapping column bits to lowest. The third and fourth are

rank lowest, and the others are bank-orient. This work finds the rank will get the best

performance if mapping to the lowest bits. Because of each layer can work independently

with low overhead rank to rank switch penalty and only shared the same TSV bus to

(35)

25

Figure 9 Different Address Mapping Method

(R: Rank / B: Bank / C: Column)

Bank level parallelism will not improve well in our simulation result. It may cause by

opening windows constrain in rank constrain in wide I/O interface. The maximum number

of row buffer be activated is two. It has to close some row buffer to open another, and the

delay penalty will decrease the system performance. This work will select the following

mapping sequence: “Row:Column:Bank:Rank:Channel” to evaluate our following

experiment.

4.2 Adaptive Prefetching Engine

OS manage the all resources allocation in general computer architecture. The memory

(36)

26

address and then convert to physical address by MMU. Memory controller must access data

with physical address. Page is based on the size of block and must be the same in virtual

and physical memory. Page size is normally set range size from 2 to 8KB. The same

application will be allocated to continuous virtual address but it will be mapping to discrete

physical block. Because programs have spatial locality, the nearly address may appear soon.

This behavior is the same in physical page block. This thesis analyzes the behavior in

physical block and prefetch data to improve performance. Our mechanism is much different

from other that this work not focus on core level prediction but physical behavior to find

memory intensive block.

TSV technology can stack different fabrication process and design more flexible. With

three-dimension integration memory layers stacked on top of logic memory control layer.

Memory and CPU can design independently and optimal by themselves. The control can be

aimed at controller and dynamic adaptive with DRAM state. Prefetching is general way to

improve DRAM performance and combine with core. Because the Prefetcher is separated to

DRAM system, it is hard to get the current memory state and only determine by return

requests’ information. In order to get more power information, the LLC must add column

and bypass to CPU. This mechanism will increase cache size and power consumption.

Prefetcher doesn’t always improve system performance due to some reasons. First, miss-predict request will waste the memory bandwidth. Because one channel is shared by

many banks, the useless prefetching request will block the TSV bus and delay demand

requests. The extra fetch data will affect the system behavior for example access other row

buffer. For some critical example, the prefetching mechanism will degrade memory system

performance a lot. Second, extra prefetch request will replace original data in cache. The

(37)

27

our prefetching will combine with controller and depend on memory rank state. This work

also adds an extra cache for our prefetch requests in the 3D memory logic layer.

4.2.1 Prefetching Reference Table

There is address spatial locality property when memory access. Although virtual address

translates to physical with discrete mapping, the same virtual memory block will be

mapping to the same physical block. With the memory intensive access program, the

continuous address would likely access soon with regular address interval. In this case,

memory access pattern will probability to predict the next memory address. On the other

hand, Memory non-intensive programs will hard to predict with memory analysis.

In order to distinguish memory intensive blocks, there is a cache table to record the

memory block utilization. The address is based on the block numbers to record the access

status. The access in the block are using an accuracy counter to reflect the continuously

access in a restrict region. This thesis selects partial bits in physical address as our tag and

index as Figure 10 illustrate. For example, if block size is 4KB with 8GB memory system,

the address has to shift right it mean there are total 2M blocks which size is in

our system. In order to reduce extra cost for record all 2M entry block, this work use a

(38)

28

Figure 10 Prefetching Table

4.2.2 Adaptive-Prefetching Control

In order to reduce useless prefetch requests, the control will not only reference the

prefetching table but collocation with rank state restriction. Our prefetching control flow is

show in Figure 11.

First, this work adaptives the prefetch degree depend on the accuracy in reference.

There is a threshold to restrict the Prefetcher. If the counter higher than threshold, the

different prefetching degree is depend on the accuracy counter. Second, many researches

show the prefetch requests will more likely useless if the request is delay for a long time. In

our design, the prefetch request will insert depend on the command queue entry size. If the

command buffer is already has many requests, prefetch request will be delayed because it

has to process previous request no matter in FCFS property scheduling or demand first

priority based scheduling. In other word, prefetch requests will access data and waiting for a

long time. The advantage of prefetching mechanism will decrease. To prevent this case, our

(39)

29

prefetching request or not.

Prefetch Table Hit T F Update tag & Counter=0 Accuracy Counter > Threshold F T No Prefetch Request Insert Prefetch Request Last_addr -addr < range Accuracy counter--T _F Accuracy

counter--Figure 11 Prefetching Flowchart

4.3 Hide Latency Scheduling

Memory Stacked will improve the system performance and change the memory

property when TSV technology matures. With 3D architecture, there is some opportunity to

optimize the overall system. For example, controller can continuously issue command to

each rank because zero rank to rank switch delay. The rank switch overhead will become

small than before. Continue access different layers data in 3D memory use the same TSV

bus. Memory controllers have to determine which rank can transfer data. Data will delay

(40)

30

controllers lock the TSV bus, it will release many cycles decide by the data burst length. In

other word, next data command will be delay by previous request because the TSV bus

resource is not release yet. In those waiting cycle, other ranks will have opportunity to issue

command and the delay will be overlapped.

FR-FCFS is the most popular scheduling for the DRAM memory system now. It

schedules ready request first and promote the row buffer hit ratio. Most research is based on

this algorithm to improve in case study. The scheduling is not optimized for 3D stacked

memory when the rank to rank switch penalty disappears.

Figure 12 Timing Diagram for Hide Latency Scheduling

Figure 12(a) show the case when continue issue data accessing commands to different

rank. In this case, R0 issue first and R1 after it immediately. The TSV data bus will become

bottleneck and R1 data has to wait for R0 transfer complete. Figure 12(b) is optimized for

this situation. As soon as data access command will be served, there is an opportunity to

issue commands to those not use data bus because the next data will wait for controller

(41)

31

In order to improve the performance, Memory controllers can previous precharge the

row buffer when the bus is busy. This work uses a counter which is call BLCD(Bus Length

Counter Down) to express data utilization and exist in each bank. Whenever controllers

server a request, the BLCD will increase number of burst length. BLCD counter will count

down every cycle until zero.

Our algorithm will get those requests which not use data bus higher priority when the

periods of data transferring delay. To reduce miss predict close page probability, there is a

table to record the last row access status recently. The precharge table provides previous

row access information. When a new row is activated, controller will search precharge table

and load value into row predict counter.

“Row prediction counter” store last row access times. The other counter is called access counter, it will use 4 bit saturating counter and express current row access times. If row

access counter is larger than row predict counter, the row will probability close soon.

Predict close row buffer will bring some advantage. First, command issue when TSV bus is

busy and hide precharge latency as show in Figure 12. Second, predict close row buffer will

save power because the activated row buffers will cause more power consumption than

close buffer. Open page policy will hold data until the precharge command issue. In other

word, more power waste because recharge comes too late.

(42)

32

Looking for the BLCD to determine whether issue precharge or not. If the BLCD is not

equal to zero, the row predict counter will compare to access counter. When the row predict

counter is bigger the precharge will issue to rank, or it will follow original schedule to issue

command. The control flow is described in Figure 14.

F

T

Next Cycle

&

BLCD

--BLCD += x

Request

Ready

Prioritize

Oldest

T

F

PRECHARGE

Row Buffer

T

F

_FR-FCFS

Scheduling

Predict PRE

BLCD == 0

(43)

33

V. Experimental Results

5.1 Environment Setup

5.1.1 System Configuration

Table 4 System Configuration

This work uses GEM5 which is a full system simulator to get the memory traces. Our

(44)

34

target simulator environment. The memory trace information include timestamp and

physical memory address are generated by running benchmark on GEM5. This thesis

evaluate the DRAM system with parsec[20] and spec2006[21] benchmark which supplies

parallel program and memory intensive workload. The detail benchmark is show detail in

table 6. Application blackscholes in_64K bodytrack sequenceB_4 4 4 4000 5 0 8 facesim timing freqmine kosarak_990k.dat 790 x264 eledream_640x360_128.y4m 401.bzip2 input.program 5 429.mcf inp.in 444.namd --iterations 1 450.soplex m10000 458.sjeng 464.h264ref foreman_encoder_baseline.cfg 470.lbm reference.dat 473.astar lake.cfg 483.xalancbmk xalanc.xsl spec2006

parsec 8 channel / Large

Parameter

test

Table 5 Overview of Benchmark

The memory configuration is described our memory system timgin constrain parameter.

(45)

35

Name

Description

Value

tCK

Clock

5ns

RL

CAS latency

15ns

tRCD

Row to column Delay

18ns

tRP

Row recharge delay

18ns

tRAS

Row active time

42ns

tRFC

Refresh time

210ns

tTAW

Two bank active widwow

50ns

tREFI

Refresh time interval

3.9 us

tWTR

Write to read delay

15ns

tRRD

Row to row active delat

10ns

tXP

Power down

10ns

BL

Burst length

4

Table 6 Memory Configuration

5.1.2 Power Model

Wide I/O interface is a new technology and not release physical specific with product

yet. This work selects LPDDR2 as reference target because its property is close to Wide I/O.

Our power parameter reference Micron LPDDR2 to analysis new design in trend. The

power formula is voltage multiple current. There are two voltages support in LPDDR2:

VDD1 is set to 1.8V and VDD2 is set to 1.2V. The current multiple correspond voltage and

accumulate when following status occur. If controller issue some commands the action

power will be accumulated. Otherwise, background calculates DRAM idle power and

different from DRAM power mode and row buffer status. Table 8 shows LPDDR2 power

(46)

36

Table 7 LPDDR2 Power Parameter

5.2 Precharge Prediction Analysis

Because some compoments of our control mechanism is an extra overhead, this thesis

takes not only performance but also hardware cost overhead into consider. The extra

precharge mechanism in our design using a precharge prediction table to record the row

(47)

37

Figure 15 Precharge Prediction Table Size

The x-axis is the number of table entries and the y-axis is improvement percentage by

original. The average line is increasing rapidly before the table entries is 16. It grows slowly

and saturation when the number is greater than 16 because the buffer will not change row

frequently in a period. It may be caused by RLP-oriented address mapping and FR-FCFS

(48)

38

Figure 16 Latency Improvement by different Scheduling

(49)

39

Figure 16 show the latency in our Latency overlapping scheduling by precharge

prediction and compare to the MLP scheduling. MLP scheduling separate read/write in

different command queues and issue write command only when the number of write exceed

threshold. Our method is less than MLP scheduling by improvement latency because MLP

save not only read to write timing delay but prioritizes the read command. Figure 17 show

our method can help to save power because pre-close row buffer can use lower current and

more power efficiency. Because our scheduling is control those precharge command in bus

busy slot, it’s independently with MLP scheduling. The result shows our method can

combine MLP scheduling without conflict.

5.3 Adaptive-Prefetch Analysis

(50)

40

Figure 18 shows our analysis about prefetch cache size. The x-axis is the number of

table entries and the y-axis is memory improvement by original version. Because each entry

size is up to 64B to store full cache block data. The prefetch cache size will increase rapidly

if the cache extend. In our analysis, when table is larger than 128 entries, the performance

will improve slowly but double the cache size because the prefetch requests not

replacement so frequently and the size is enough. This work selects 128 as prefetch cache

entries.

Figure 19 Improvement Relative to No Threshold Prefetching

In order to reduce useless prefetch request and improve rank level parallelism, this

work proposes constrain the number of prefetch requests when the command buffer exist

requests more than a threshold. This method not only helps to reduce useless prefetch

requests but also rank balance. Figure 19 indicates threshold is set to quarter of command

(51)

41

Figure 20 Prefetch Cache Hit Rate

Figure 20 show the hit ratio is nearly 50% accuracy in our adaptive-prefetching

technique cache.

(52)

42

Figure 22 DRAM Power Overhead by Adaptive-Prefetching Technology

Figure 21 show our adaptive-prefetching mechanism performance improvement of

DRAM latency. It will improve nearly 11.7% in average. Our control will limit prefetch

requests so the performance is also limited. Figure 11 show the extra energy overhead.

Although prefetch mechanism can reduce latency, useless data will cause extra power

(53)

43

5.4 Hardware Cost

Table 8 Precharge Predict Overhead

Table 8 show the hardware overhead of latency overlapping scheduling implement.

Because the precharge predition table is exist in each bank. The table size has to multiple to

number of bank.

Table 9 Adaptive-Prefetch Overhead

Table 9 show the cost for Prefetching mechanism and cache. Because the prefetcher

and cache exist in each channel controller. The table size has to multiple to number of

(54)

44

VI. Conclusion

This thesis provides an ESL platform with TSV architecture that can support wide I/O

interface. Our simulator is modularize with configurable design parameters so it’s ease to

model different architecture. This work also speedup the simulation time by using pthread

acceleration method for each channel. The speedup is approximate 3.1 times with four

channel controller design relative to sequential version.

With analysis of different memory mapping methods in wide I/O, results show that rank

level parallelsim is good for 3D design because the rank to rank switch penalty decreases

and the maximum number of activated windows is constrained.

Finally, this work proposes two mechanisms to improve system performance. Adaptive

prefetching techniques analyzes memory intensiveness and access locality to prefetch data

and considers the number of rank command to improve latency nearly by 13% with 3.5%

power overhead. Latency overlapping schedules ahead of precharge command by

overlapping TSV data transfer delay. Latency overlapping scheduling helps to reduce 0.6%

latency and save 4% power consumption by pre closing row buffers.

The hardware cost is an extra overhead in design. This work tradeoff overhead to get

(55)

45

References

[1] C. Weis, N. Wehn, L. Igor, and L. Benini, “Design space exploration for

3D-stacked DRAMs,” in Design, Automation Test in Europe Conference Exhibition (DATE),

2011, pp. 1 –6 (2011).

[2] C.-H. Lin, W.-T. Hsieh, H.-C. Hsieh, C.-N. Liu, and J.-C. Yeh, “System-level

design exploration for 3-D stacked memory architectures,” in 2011 Proceedings of the 9th

International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), p. 389 (2011).

[3] J.-S. Kim, C. S. Oh, H. Lee, D. Lee, H. R. Hwang, S. Hwang, B. Na, J. Moon,

J.-G. Kim, et al., “A 1.2 V 12.8 GB/s 2 Gb Mobile Wide-I/O DRAM With 4 * 128 I/Os

Using TSV Based Stacking,” IEEE Journal of Solid-State Circuits 47, 107 –116 (2012).

[4] G. H. Loh, “3D-Stacked Memory Architectures for Multi-core Processors,”

SIGARCH Comput. Archit. News 36, 453–464 (2008).

[5] K.-M. Choi, “An industrial perspective of 3D IC integration technology from the

viewpoint of design technology,” in Design Automation Conference (ASP-DAC), 2010 15th

Asia and South Pacific, pp. 544 –547 (2010).

[6] I. Loi and L. Benini, “An efficient distributed memory interface for many-core

platform with 3D stacked DRAM,” in Proceedings of the Conference on Design,

Automation and Test in Europe, pp. 99–104, European Design and Automation Association,

3001 Leuven, Belgium, Belgium (2010).

(56)

46

access scheduling,” 2000, New York, NY, USA, 128–138, ACM.

[8] D. Kaseridis, J. Stuecheli, and L. K. John, “Minimalist open-page: a DRAM

page-mode scheduling policy for the many-core era,” 2011, New York, NY, USA, 24–35,

ACM.

[9] J. Stuecheli, D. Kaseridis, D. Daly, H. C. Hunter, and L. K. John, “The virtual

write queue: coordinating DRAM and last-level cache policies,” 2010, New York, NY, USA,

72–82, ACM.

[10] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, “ATLAS: A scalable and

high-performance scheduling algorithm for multiple memory controllers,” January 2010,

1 –12.

[11] O. Mutlu and T. Moscibroda, “Parallelism-Aware Batch Scheduling: Enhancing

both Performance and Fairness of Shared DRAM Systems,” 2008, Washington, DC, USA, 63–74, IEEE Computer Society.

[12] S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt, “Feedback Directed Prefetching:

Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers,” February

2007, 63 –74.

[13] E. Ebrahimi, O. Mutlu, C. J. Lee, and Y. N. Patt, “Coordinated control of multiple

prefetchers in multi-core systems,” December 2009, 316 –326.

[14] C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt, “Improving memory bank-level

parallelism in the presence of prefetching,” 2009, New York, NY, USA, 327–336, ACM.

[15] D. Joseph and D. Grunwald, “Prefetching using Markov predictors,” in

Proceedings of the 24th annual international symposium on Computer architecture, pp.

252–263, ACM, New York, NY, USA (1997).

(57)

47

challenges,” in 11th International Symposium on High-Performance Computer Architecture,

2005. HPCA-11, pp. 248 – 252 (2005).

[17] I. Hur and C. Lin, “Adaptive History-Based Memory Schedulers,” in Proceedings

of the 37th annual IEEE/ACM International Symposium on Microarchitecture, pp. 343–354,

IEEE Computer Society, Washington, DC, USA (2004).

[18] C. Zhang and S. A. McKee, “Hardware-only stream prefetching and dynamic

access ordering,” in Proceedings of the 14th international conference on Supercomputing,

pp. 167–175, ACM, New York, NY, USA (2000).

[19] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “DRAMSim2: A Cycle Accurate

Memory System Simulator,” Computer Architecture Letters 10, 16 –19 (2011).

[20] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC benchmark suite:

characterization and architectural implications,” in Proceedings of the 17th international

conference on Parallel architectures and compilation techniques, pp. 72–81, ACM, New

York, NY, USA (2008).

[21] J. L. Henning, “SPEC CPU2006 benchmark descriptions,” SIGARCH Comput.

調適提前讀取技術與重疊延遲排程之三維寬頻記憶體

國

立

交

通

大

學

資訊科學與工程研究所

碩

士

論

文

調適提前讀取技術與重疊延遲排程之

三維寬頻記憶體

Adaptive Prefetching Techniques and Latency Overlapping

Scheduling for 3D Wide I/O Memory

研 究 生：郭泰均

指導教授：陳添福 教授

調適提前讀取技術與重疊延遲排程之三維寬頻記憶體

Adaptive Prefetching Techniques and Latency Overlapping Scheduling for 3D

Wide I/O Memory

研 究 生：郭泰均 Student：Tai-Chun Guo

指導教授：陳添福博士 Advisor：Dr. Tien-Fu Chen

國 立 交 通 大 學

資 訊 科 學 與 工 程 研 究 所

碩 士 論 文

調適提前讀取技術與重疊延遲排程之三

維寬頻記憶體

研究生：郭泰均

指導教授：陳添福博士

國 立 交 通 大 學 資 訊 學 院 資 訊 學 程 碩 士 班

摘

要

Adaptive Prefetching Techniques and Latency Overlapping

Scheduling for 3D Wide I/O Memory

Student: Tai-Chun Guo

Advisor: Dr. Tien-Fu Chen

Degree Program of Computer Science

National Chiao Tung University

ABSTRACT

誌

謝

郭泰均

Table of Contents

List of Tables

List of Figures

I. Introduction

1.1 Motivation and Introduction

1.2 Contribution

1.3 Report Organization

II. Background and Related Work

2.1.1 Row Buffer

2.1.2 Row Buffer Page Policy

2.1.3 Memory System

2.1.4 TSV Technology

2.2 Memory Controller Scheduling

2.2.1 FR-FCFS Schedule

2.2.2 Fairness Scheduling

2.2.3 Scheduling Summary

2.3 Prefetch Policy

2.3.1 Comparison of Prefetch Mechanism

III. 3DRAMSim Framework

3.1 Overview

3.2 Parameter Design

3.3 Channel Controller

3.3.1 TSV Modeling

3.4 Accelerating Simulation

CPU Model

3.5 Profile Model

IV. Control Mechanism

4.1 Address Mapping Analysis

(R: Rank / B: Bank / C: Column)

4.2 Adaptive Prefetching Engine

4.2.1 Prefetching Reference Table

4.2.2 Adaptive-Prefetching Control

4.3 Hide Latency Scheduling

F

T

Next Cycle

&

研究生：郭泰均

指導教授：陳添福教授

研究生：郭泰均 Student：Tai-Chun Guo

國立交通大學

資訊科學與工程研究所

碩士論文

國立交通大學資訊學院資訊學程碩士班

_FR-FCFS