Memory-Centric On-Chip Data Communication Platform for Wireless Video

Chapter 3 Memory-Centric On-chip Data Communication Platform

3.4 Memory-Centric On-Chip Data Communication Platform for Wireless Video

The designers try to meet efficient processing capability, merge multi-task system and use green computing concept in a system. However, when they try to integrate the heterogeneous functional blocks into a system, multiprocessing technique and multimedia process unit must be used. Furthermore, as the resolution of video processing applications becomes high, video signal processors should deal with a large amount of data within a tightly bounded time. Due to the huge data accesses, the system performance strongly depends on the memory bandwidth between processors and external memories. The system needs real-time and huge memory access requirement, but the speed gap of the memory and processor unit is large in the SoC system. Many researches are trying to minimize the speed gap. A well-organized memory management can significantly reduce the memory access latency. According to the data features of these applications, designer can find a well memory allocation method to reduce the number of memory access time and average access latency.

Accordingly, for wireless video entertainment systems, memory-centric on-chip data communication platform is applied to provide a high bandwidth and satisfy enough memory requirements.

According to the receiver system as mentioned in section 3.3, the processing sequence of these multiple tasks is generally step by step. the data stream of wireless

video entertainment systems is shown in Fig.3. 15. In memory-centric on-chip data communication platform, on-demand memory system can support heterogeneous and real-time memory requirement for wireless video entertainment systems. MMUs in on-demand memory system enable the processor elements to have adaptive memory resources. Base on different memory requirement of these processor elements, centralized MMU can dynamically allocate memory resources for processor elements.

With suitable memory resource arrangement for different processor elements, the execution efficiency of the streaming processing in wireless video entertainment systems can be improved.

(Task 0)WPU MAC

(Task 1) LT coding

(Task 2) SVC

(Task 3) Video frames Input

data

On-demand memory system

Fig.3. 15 Data stream of wireless video entertainment systems

Overall architecture of the system is shown in Fig.3. 16. The system components can be categorized into data computation, data communication and data storage. For data computation, it includes WPU, MAC, LT coding and SVC processor elements.

Wrappers are applied to satisfy the specification of the pre-defined protocol.

Subsequently, the other components will be introduced as follows.

For data communication, it includes network interface (NI) and interconnection network. In this system, message-passing mechanism is applied. With this mechanism, the transmitting data are packed into packets by network interface, and through the interconnection network using a pre-defined message-passing protocol. NI packetizes the transmitting data with a header indicating the data source, destination and some data information, and then transmits to the other node. It also de-packetizes the receiving data from the other processor elements. In addition, a packet queue is included in NI to store the blocking packet.

For data storage, each distributed processor element own a d-MMU, it includes a distributed cache (L1 cache) and cache controller for memory access. It also manages the cache usage. When packet queue size in NI is insufficient, d-MMU can borrow

some unused cache block for NI. In addition, c-MMU is constructed for providing more memory resources. It includes centralized cache (L2 cache) and cache controller for processor elements. The cache controller can support dynamical cache re-organization for allocating different cache resources for different processor elements. In c-MMU, a DRAM controller is constructed to efficiently access off-chip DRAM. In DRAM controller, Address translator rearranges and translates address to have an efficient memory allocation, and the memory requests enter the memory interface with command scheduling to reduce memory access latency. The detail description of c-MMU will be described in chapter 4.

Interconnection

Fig.3. 16 On-Demand Memory System architecture

Chapter 4 Hierarchy Memory Management Units for On-Demand Memory System

In this chapter, the design of distributed memory management unit (d-MMU) and centralized memory management unit (c-MMU) in on-demand memory system will be depicted in section 4.1 and section 4.2, respectively.

4.1 Distributed Memory Management Unit Organization

Wrapper

Network Interface Distributed MMU

Processor Element

Centralized MMU On-Chip Interconnection network

Distributed Cache (L1 cache) Cache

control

Fig.4. 1 Block diagram of a local node

A local node in memory-centric on-chip data communication platform is organized by distributed memory management unit (d-MMU), Network Interface (NI), wrapper and processor element (PE). The block diagram is shown in Fig.4. 1. To provide local memory resources for each processor element, efficient d-MMU is applied to process the memory requests. Distributed cache (L1 cache) and cache controller are included in d-MMU. Additionally, NI is designed as a bridge between processor element and on-chip interconnection network (OCIN). When the packet buffer in NI is crowded, unused cache blocks can be borrowed for buffering the blocking packets from PEs. In this section, the design of d-MMU with buffer borrowing mechanism will be described.

4.1.1 Design of d-MMU

For the memory-centric on-chip data communication platform, d-MMUs are designed for PEs to store the temporal data of their tasks. Distributed cache performs as a high level cache for the dedicated PE in the on-demand memory system. In addition, a Wrapper is applied to be an interface between processor element and d-MMU. In on-demand memory system, PE uses the burst-based memory access protocol to access memory. By this protocol, read/write operation uses burst transmission mechanism so that it can access continuous data easily. The detail memory access operation will be introduced as follows.

4.1.1.2 Memory access operation

By applied burst-based memory access protocol, the read and write operations are shown in Fig.4. 2 and Fig.4. 3, respectively. With providing start address and burst length(BL) information, processor elements can efficiently access the burst data in memory. Note that the data width is 32-bit (1word) and the addressing unit is in word by definition. Accordingly, the cache miss penalties can be hidden by burst-based memory access protocol. The cache miss would be discovered immediately when a memory burst request has been served. Fig.4. 4 provides the explanation of hiding miss penalties. In Fig.4. 4(a), a read request with miss follows by a read request with hit. The miss penalty can be hidden because the data transmit of the first read haven`t been finished. For the memory write request as shown in Fig.4. 4(b), all the miss in the burst can be found immediately whenever write request comes, so it also can hide the miss processing latency of memory write.

CLK

Fig.4. 2 Illustration of read operation

CLK

Fig.4. 3 Illustration of write operation

The maximum burst length is eight in the pre-defined protocol. In order to support that d-MMU can immediately check whether a memory burst request is miss when the request comes, two cache banks with 32 bytes (8 words) block size are allocated in d-MMU. With this allocation, a memory burst request would reference either a cache line in a cache bank or two cache lines in different cache banks, so the cache hit/miss detection can be finished in a cycle. The illustration of cache address mapping will be shown in Fig.4. 7. Note that 32Kbyte cache size and 4-way associativity configurations in each bank are applied in the illustration.

READ Miss Penalty

Fig.4. 4 Illustration of hiding miss penalty

Additionally, NI is designed as a bridge between the PEs and the OCIN [4.1]-[4.4].

NI contains the input queue and output queue for buffering packets. However, the sizes of the queues dominate the area and the performance. If the buffer is insufficient, the PE will be stall until the head-of-line blocking releases. Therefore, if the utilization of the distributed memory is low, the d-MMU can borrow the memory resources for buffering the blocking packets from the PEs, and the PEs can keep

computing for their tasks. Below the d-MMU with buffer borrowing mechanism will be introduced in detail.

4.1.1.2 Buffer Borrowing Mechanism

The architecture of proposed d-MMU and efficient Network Interface with buffer borrowing mechanism is shown in Fig.4. 5. The NI uses a buffering control to generate a borrowing request to the d-MMU for borrowing memory resources. And thus, the d-MMU checks the valid table and generates the borrowing address for the NI. Fig.4. 6 presents the buffer borrowing interface between the NI and d-MMU. The operations of the buffer borrowing include write, read and release. For the write operation, the buffering control should send a buffer request to the d-MMU first, and send the blocking data until receiving a grant signal. However, the head-of-line blocking may release while waiting the grant from d-MMU or setting the data.

Therefore, a release operation can release the extension memory resources. The details of the borrowing address generator and buffering control will be described as follows.

Wrapper Valid Table

Distributed MMU

Flow Control

Packetization

arbiter

Processor Element Cache

Network Interface

On-Chip Interconnection architecture

Buffering

Control FIFO Borrowing

Address Generator Cache

Control

Borrowing mechanism

Centralized MMU

Fig.4. 5 d-MMU and efficient Network Interface

N_BUF_REQ

Fig.4. 6 Buffer borrowing interface between NI and d-MMU

4.1.1.2.1 Borrowing Address Generator

When the NI requests an extend buffer to store the blocking packet, the borrowing address generator searches an empty space in the distributed memory via checking the valid table. This valid table is attached in the cache tables as shown in Fig.4. 7. The distributed memories are divided into two banks with four-way association. The memories corresponding to the last associated table in bank 0 and bank 1 are infrequently used in opposition to others. Therefore, the d-MMU can borrow the empty spaces corresponding to this table. Moreover, each cache line in the four-way association contains 4x8 words. Therefore, the maximum payload of a packet can be stored in a memory block (8 words) in one cycle. If a memory block is borrowed, the d-MMU asserts the status bit that represents the borrowing data. Depending on the

Status bit : represents the data is buffering data

mask

Hit detector Hit?

Address from processor element (word address)

10 1 3

Valid table

Validdirty

4-way associativity

Fig.4. 7 Borrowing mechanism in d-MMU

Search window (128 bits)

Empty detector

Search counter MUX

7 2

Valid bit

Buffer borrowing address (word address) All full?

Fig.4. 8 Architecture of the empty memory block searching

After the NI send a borrowing request to the d-MMU, the NI should take 2-8 cycles for collecting the payload. Most packets contain 8 flits in their payloads, and the average size of payload is about 4 words. Therefore, the d-MMU has to search the empty memory block in 4 cycles. Additionally, the last associated tables in bank 0 and bank 1 contains 512 valid bits. To search the empty memory block, a 128-bit searching window is adopted. Fig.4. 8 shows the architecture of the empty memory block searching. The searching window is controlled by a search counter. The empty detector detects an empty memory block and generates the borrow address with the search counter. If all memory blocks in a searching window are full, the searching windows will move to the next 128 bits. Fig.4. 9 shows the searching flow chart of the borrowing mechanism. The flow can be divided into three steps, which are empty memory block searching, borrowing status setting, and data writing. The operations of empty memory block searching and borrowing status setting are described above.

While writing data in the borrowing memory block, the borrowing address should be stored in the address queue for reading operations. After writing the payload into the memory block, the grant signal is changed to 0 for the next borrowing request.

Put the borrowing address to the address queue

& write the buffer data to cache ; Grant=0 Empty memory

block searching

Borrowing Status Setting

Data writing

Fig.4. 9 Searching flow chart of the borrowing mechanism in d-MMU

4.1.1.2.2 Buffering Control

The buffering control in NI detects the empty size of the output queue and sends the borrowing operations to d-MMU. Fig.4. 10 shows the block diagrams of borrowing mechanism in the buffering control. The buffering control sends the write, read, and release operations depending on an empty pointer of the output queue and a borrowing pointer of the borrowing header queue. The empty pointer and borrowing pointer indicate the number of the occupied buffers in the output queue and borrowing header queue, respectively. In addition, the write control contains a payload queue for collecting the payload, and then writing this payload to the borrowed memory block.

The borrowing control policy of the buffering control is presented as shown in Fig.4.

11. The borrowing mode indicates whether the blocking data stored in the d-MMU or not. Therefore, after receiving data from the PE, the data should be stored in the

d-MMU in the borrowing mode. Otherwise, the data can be stored in the output queue when the size of the empty slots is larger than the payload. While waiting the borrowing grant from d-MMU and collecting the payload, the head-of-line blocking may be released. Therefore, the borrowing mechanism can also be released if the borrowing mode equals to zero. The release signal will interrupt the search operation of d-MMU.

Fig.4. 10 Block diagrams of borrowing mechanism in network interface

Borrowing

Fig.4. 11 Borrowing control policy of the buffering control

4.1.1.2.3 Simulation Results of Buffer Borrowing Mechanism

The proposed d-MMU, NI and memory-centric OCIN are implemented in SystemC for the cycle-driven simulation. Thereby, the simulation environment is set as a 4x4 router with 4 PEs to evaluate the performance improvement via the efficient NIs. Fig.4. 12(a) shows the execution time of transferring 200000 packets under various injection loads and queue sizes. With the increasing injection load, the execution time decreases because the transferred packets are fixed. Additionally, Fig.4.

12(b) shows the number of transferred packets in 300000 cycles under various injection loads and queue sizes. Based on the simulation results, the proposed borrowing mechanism can achieve the similar performance with different queue sizes.

Moreover, the proposed efficient NI can realize about 1.15x performance improvement compared to the conventional one with 16flits.

16 24 32 40 48 56 64

Queue Size in Network Interface (Flits) Execution time (x106 Cycles) 1.13x Number of Transferred Packet (x104)

Queue Size in Network Interface (Flits) Injection load = 0.15

Fig.4. 12 (a) Execution time under various injection loads and queue sizes (b) Transferred packets under various injection loads and queue sizes.

4.2 Centralized Memory Management Unit Organization

The distributed memory resources may be insufficient for PEs. Lower level cache is applied to provide larger on-chip memory resources. Centralized cache and cache controller is included in centralized memory management unit(c-MMU). According to distinct memory resource requirements from different PEs, the proposed c-MMU can allocate different cache resources for each PE. In addition, the external memory is required for storing the huge data such as video frames in video processing. A DRAM controller is constructed in c-MMU to access DRAM device. The overall c-MMU

architecture with adaptive cache control and DRAM controller will be introduced in the following sections.

4.2.1 Design of c-MMU

The simple block diagram of the c-MMU is shown in Fig.4. 13. It is organized by an adaptive cache controller, switches, several SRAM sub-blocks and DRAM controller. Adaptive cache controller accepts the memory requests from d-MMUs. The requests issued by different d-MMU can simultaneously be executed if the used memory resources have no conflict. Cache controller will check the selected cache tables to determine whether the data is in the cache or not. According to the check result, the corresponding data and addresses are forwarded to the SRAM sub-block or DRAM controller by switch. For read requests, the read data forward to the output switch and send back to d-MMUs. In addition, the address translator and external memory interface are constructed to efficiently access the external memory.

Adaptive Cache Controller

Fig.4. 13 c-MMU block diagram

Different applications may have different memory resource requirement. Even in the same application, it may have various memory behaviors at runtime. The proposed c-MMU can dynamically adjust and allocate suitable memory resources to each processor element. The concept of the adaptive memory resource allocation is shown in Fig.4. 14. Base on different memory requirement in different processor elements, unequal memory resources are allocated. Adaptive cache control scheme will be described in detail as follows.

SRAM 7

Fig.4. 14 Concept of the adaptive memory resource allocation

4.2.1.1 Adaptive cache control

In our work, the principle of adjusting the cache size is base on selective cache ways which had been proposed in [4.5]. With selecting different number of ways, the different cache size can be assigned for processor element. It is a simple method with less area and timing overhead for cache reconfiguration. In proposed c-MMU organization, associativity-based partitioning scheme is applied for the cache partition.

Each SRAM sub-block represents a way and form a bank for the cache organization.

Assume there are number of N SRAM sub-blocks in c-MMU, it represents there have N-way associativity capacity in centralized cache. For different processor elements, the SRAM blocks can be grouped into several groups for processor elements. Fig.4.

15 shows the example of SRAM bank partition. Assume the system have X processor elements and c-MMU has N SRAM banks. The memory partition can be achieved as illustrated in Fig.4. 15.

Fig.4. 15 Illustration of the memory partition

In order to dynamically allocate the memory resources for different processor elements at runtime, a Bank Assignment Table (BAT) is applied for recoding the memory usage information of three time intervals. Fig.4. 16 illustrates the cache table checking method when a request is served. According to the corresponding processor element node ID, the cache controller searches the BAT and returns the assigned bank numbers. These bank numbers indicate which bank tables need to be checked for the request. Fig.4. 16 shows the example that four banks are applied for node 3 in the first time interval. When a request from node 3 is served, Bank0, Bank1, Bank2 and Bank3 tables will be selected for hit checking. By this configuration, node 3 can own a 4-way associativity L2 cache memory resource for processing.

Index offset

Fig.4. 16 Illustration of the cache table checking

For the multi-task system, multiple memory requests from different processor element can be served simultaneously in c-MMU because the checking tables are independent for different nodes generally. Fig.4. 17 shows the illustration of checking multiple requests. The target Bank tables are selected in accordance with BAT information, and the check functions are operated independently.

Node 3 Node 2 Node 1

node TAG node TAG node TAG

Target Bank tables

Fig.4. 17 Illustration of checking multiple requests

The processor elements may have different memory behaviors in different time interval at runtime. The BAT can recode the configuration in different time interval. It is updated by the processor element which can profile the memory requirements of the system. For the wireless video entertainment systems, the effective bandwidth of the channel can be detected by MAC. According to the detection of the wireless channel, the transmitter can determine the scalable level of SVC bitstream to satisfy the effective bandwidth. Based on various bitstream, the memory requirement of different quality levels is also various and can be profiled off-line. In view of these,

在文檔中應用於無線影像娛樂系統的隨選記憶體系統 (頁 53-0)