3DRAMSim Framework - 調適提前讀取技術與重疊延遲排程之三維寬頻記憶體

3.1 Overview

This section describes the feature of our 3DRAMSim simulator. 3DRAMsim is a DRAM cycle accurate simulator which is based on DRAMSim2[19]. Our simulator can provide 3D memory environment to analyze design in an early-stage. In our simulator, we also can simulate traditional DRAM like SDRAM or DDR memory from JEDEC protocols.

Our simulator can easily model different type of memory by modify the system configuration files and timing parameters. All DRAM properties are parameterized and described in the memory configure file. With the development of TSV technology, system can also control different layer independently and share the same data TSV bus. Our simulator can also model multi-channel architecture and concurrently operate of DRAM.

Figure 6 Program Execution Flow

Figure 6 illustrates the 3DRAMSim flow diagram. The proposed simulator includes three parts: trace splitter, memory controller, and memory storage. The trace splitting stage includes overall system timing control, and address mapping analysis tool. In order to simulate the multi-channel environment, trace must be pre-process, split into sub-traces and issued by each channel controller. Trace splitting has to know the address mapping method and separate requests mapping into their sub-trace. After the trace splitter, each controller increases their own timing domain count. If this counter is less than the trace’s timestamp, the trace will issue until the counter equal to the timestamp and another request will not be issued in the same channel. After all trace are finish, system will show those information and physical execute time to help us evaluate our design.

3.2 Parameter Design

Many specifics described as parameters, can improve program flexibility and usability.

It can simulate other memory architecture by modifying some critical property with low overhead. There are two classic configurations for our simulator.

First, memory configuration file will provide the timing and current parameter. It also gives the architecture structure information for DRAM like number of banks and ranks.

Second, system configuration file provides the policies for our target DRAM. These policies include row buffer policy, memory address mapping, and scheduling method. With many policies, we have to modularize respective functions to make the program more readable and easy to expand.

Finally, we use indirect pointer to link the function section in the initial time to save the comparison time in each simulation cycle. After the pointer is built successfully, we can use

this pointer to call specific function directly. It can easily switch to other policies without recompilation of the program, making it more efficiency.

3.3 Channel Controller

BankState is a table that describe the current bank state including current row buffer state, current command issued, and requests action corresponding to reasonable cycle time to fit the timing constrain. This state table will determine when the command can be issued or not in this cycle and update the table state whenever a command is committed.

Channel controller is the main control of the memory system and operates disjointed portions of memory stack independently. Command queues store those memory requests scheduled by the scheduler in the buffers. Commands sequence will be reordered and issued out of order to improve the memory performance. In our 3DRAMsim simulator, this work provides the well-know FR-FCFS scheduling policy.

Each controller access a portion of the memory stack. They do not cause interference and the clocks are maintained by its own domain. The timing clock is store in the controller object and increases every time controller update called by the upper system. As soon as the timer increases, next cycle will come and schedulers compare the bank state timing constraint with the current clock to schedule those commands.

3.3.1 TSV Modeling

All ranks are shared by the same TSV bus in region of the same channel controller. The signal can be classified into two kinds: control signal and data bus. The total numbers of

TSV is 300 in wide I/O interface for each controller. Controller can continuously send a command to different ranks every cycle if the command conforms to the timing constrain.

Because wide I/O is burst oriented, all the data accesses are uses burst mode to transfer.

The data bus must be locked a period by data burst transfer with many cycles. In wide I/O interface, the burst length is either two or four cycles. Also, the stacked memory property only restricts each timing constraint in the current rank. In other word, the data commands may be sent continuously. The data bus competed by controllers, so it must have an arbiter to determine which one can use TSV data bus in this period and when to release the resource. There is a counter in our wide I/O bus model. It will add burst length to reflect the bus transfer when the arbiter decides which controller has the next period. Memory controller will call the lock function to prevent different ranks using the TSV bus when transaction is not complete. After TSV data bus is locked, the counter will count down to zero and then unlock the data bus. If the bus is free, the controller will select the earliest command for scheduling and lock the data bus. It will simulate command continuously and access data sequentially to transfer. Our TSV bus modeling can more accurately describe data bus utilization. Simulator can also reflect the bandwidth by calculating the locking cycle. It is an important metric for memory intensive programs.

In order to save power, it can switch to low power mode if the command queue is empty and the state is idle for a rank. If there are other commands injected into the command queue, the rank will switch to normal mode after in a period. It can calculate the activate time in the power up cycle to reflect utilization for each rank.

3.4 Accelerating Simulation

In recent development, DRAM has more and more ranks and controllers to improve access latency. With the simulation components increasing, simulation time also become much longer because each object has to update its state or transfer data every cycle in the simulator. Some of them are unnecessary updates only check and increase cycle when the command queue is empty. There are many potential opportunities to speed up the DRAM simulator. For example, the previously mentioned indirect pointer saves unnecessary comparison time when program executes. Also, it can find and improve program parallelism issues to speed up simulation time. Our speedup mechanism in the 3DRAMsim focuses on channel controllers because the control, data, and clock signals are critical to independent control. It means different channel controllers can be executed on their own and the executions are likely to be parallelized. Furthermore, the result of the speedup version must be the same as original version to prove the correctness. In order to speed up the channel controllers, this work has to modify the program execute about the channel update function.

The original version updates channel use loop and update sequentially whenever system cycle increase and the ordering will not be affected because the controllers are independent.

If the system only has one trace for simulator, it is hard to reflect concurrent execution for a multi-channel architecture. The trace splitter is described in the previous section. After the memory system is created, it also creates many threads for each channel controller in our simulator. Figure 7 show the concept of the thread creation for each channel. In our multi-thread version, each thread updates their own ranks partially without dependency. The simulation cycle is the maximum cycles in all the sub-trace returned from the trace splitter.

Because of threads’ difference in speed, it has to wait for all controller threads to finish its job, so there is a synchronous point to check for whether other controllers are completed or

not. The program will insert a barrier to wait for the slower job and print all the information after all controllers simulate completed. The result in both versions must match to verify the speedup version’s correctness. Figure 7 Pthread Accelerating Method

Figure 8 show our simulator speeds up 3.1 times compare to sequential version in average. Considering threads diversity, some threads have to wait for the slowest one, the degree of speedup is always nearly the number of threads if host machine has enough cores and can support threads to execute independently. The maximum speedup will be proportional to the number of channels and the upper bound is four times in wide I/O interface.

Figure 8 Simulation Speedup Times

3.5 Profile Model

With the TSV technology coming, there are some differences from traditional DRAM memory system. Controllers can issue different command to every rank by interleaving. The rank can issue commands only when sharing the same data bus. The timing constraint in different ranks only have to consider their layer inside. The property of the parallelism factor will also be different because the ranks behavior also changes. For example, the rank to rank switch time will impact the rank level parallelism and the activated row buffer constraint will impact bank level parallelism. There are more complex factors for 3D IC development, and the best addressing mapping policy will become harder to find. Our simulator provides some simple methods that can find a better mapping method to fit your demand. Better mapping methods can be found by two steps. After mapping analysis, the simulator will find the trend of different mappings for the current architecture.

First step of evaluation is focused on the rank and channel utilization in the memory system. The delay becomes longer because the parallelism factors decrease. If ranks or

channels mapping are unbalanced, it will cause some resource to idle, causing violent competition in some rank. This step will eliminate the most of the impossible mapping methods, but some case is hard to find because the number of commands will continued to increase in disperse ranks for a long time.

Second, memory level analysis is important to mapping performance. It will run 3DRAMsim with trace by without timestamp. This step will show parallelism because the command dependency will be roughly taken into consideration. The simulation time is faster than the complete simulation because it does not have to wait for the trace interval for command issues. This work chooses those potential methods as candidates and analyze artificially. The traditional mapping methods consider the bank level parallelism and column access locality in the simple memory system.

The results show mapping channel’s bits to lowest bits is better because when mapped to upper bits, the bits are more unbalanced and have less channel level parallelism, since having independent channel controllers achieves the maximum parallelism. Moreover, the row bits are mapped to higher bits because this way there is no parallelism issue between each row. If the row bits are mapped to the lower bits, it may more likely use a different row and cause more penalty for row buffer switch.

在文檔中調適提前讀取技術與重疊延遲排程之三維寬頻記憶體 (頁 26-34)